SlideShare a Scribd company logo
Kim, Sungchul
Contents
▪ Introduction
▪ FeatUp
▪ Experiments
▪ Conclusions
Introduction
Introduction
▪ Deep features often sacrifice spatial resolution for semantic quality
• ResNet-50 produces 7 × 7 deep features from a 224 × 224 pixel input (32× resolution reduction)
• Vision Transformers (ViTs) incur a significant resolution reduction
▪ FeatUp: a novel framework to improve the resolution of any vision model’s features
• inspired by 3D reconstruction frameworks like NeRF that is multi-view consistency of low-resolution signals can
supervise the construction of high-resolution signals
Introduction
▪ Proposed method, FeatUp
• to significantly improve the spatial resolution of any model’s features, parametrized as either a fast feedforward
upsampling network or an implicit network
• with a fast CUDA implementation of Joint Bilateral Upsampling orders of magnitude more efficient
• can be used as drop-in replacements for ordinary features to improve performance on dense prediction tasks and
model explainability
FeatUp
▪ Intuition
• Compute high-resolution features by observing multiple different “views”
• Two steps in the pipeline
Step 1. generate low-resolution feature views to refine into a single high resolution output
Step 2. construct a consistent high-resolution feature map from these views
FeatUp
▪ Step 1
• Generate low-resolution feature views to refine into a single high resolution output
• Perturb the input image with small pads, scales, and horizontal flips
• Apply the model to each transformed image to extract a collection of low-resolution feature maps
FeatUp
▪ Step 2
• Construct a consistent high-resolution feature map from these views
• Track the parameters used to “jitter” each image instead of using estimating parameters
• Apply the same transformation to our learned high-resolution features prior to downsampling
• Compare downsampled features to the true model outputs using a gaussian likelihood loss
FeatUp
▪ Reconstruction Loss
ℒ𝒓𝒆𝒄 =
𝟏
𝑻
෍
𝒕∈𝑻
𝟏
𝟐𝒔𝟐
𝒇 𝒕 𝒙 − 𝝈↓ 𝒕 𝑭𝒉𝒓 𝟐
𝟐
+ 𝒍𝒐𝒈 𝒔
• 𝑡 ∈ 𝑇 : a collection of small transforms such as pads, zooms, crops, horizontal flips, and their compositions
• 𝑥 : an input image
• 𝑓 : model backbone
• 𝜎↓ : a learned downsampler
• 𝜎↑ : a learned upsampler
• 𝐹ℎ𝑟 = 𝜎↑ 𝑓 𝑥 , 𝑥 : the predicted high-res features
• 𝑠 = 𝒩 𝑓 𝑡 𝑥 : a spatially-varying adaptive uncertainty
• 𝒩 : a small linear network
FeatUp
▪ Choosing a Downsampler
• Two options:
• A fast and simple learned blur kernel
• A more flexible attention-based downsampler
FeatUp
▪ Choosing a Downsampler
• A fast and simple learned blur kernel
• Blurs the features with a learned blur kernel and can be implemented as a convolution applied independently to each channel
• Normalized to be non-negative and sum to 1 to ensure the features remain in the same space
• Cannot capture dynamic receptive fields, object salience, or other nonlinear effects → DO NOT USE
FeatUp
▪ Choosing a Downsampler
• A more flexible attention-based downsampler
• Uses a 1x1 convolution to predict a saliency map from the high-resolution features
• Combines this saliency map with learned spatially-invariant weight and bias kernels
• Normalizes the results to create a spatially-varying blur kernel that interpolates the features
𝜎↓ 𝐹ℎ𝑟 𝑖𝑗 = softmax 𝑤⨀Conv 𝐹ℎ𝑟 Ω𝑖𝑗 + 𝑏 ⋅ 𝐹ℎ𝑟 Ω𝑖𝑗
• 𝐹ℎ𝑟 Ω𝑖𝑗 : a patch of high resolution features corresponding to the 𝑖, 𝑗 location in the downsampled features
• The main hyperparameter for the downsampler is the kernel size, which should be larger for models with larger receptive fields
FeatUp
▪ Choosing an Upsampler
• Two variants:
• “JBU” FeatUp parameterizes 𝜎↑ with a guided upsampler based on a stack of Joint Bilateral Upsamplers (JBU)
• Learns an upsampling strategy that generalizes across a corpus of images
• “Implicit” FeatUp uses an implicit network to parameterize 𝜎↑ and can yield remarkably crisp features when overfit to a single image
• Both methods are trained using the same broader architecture and loss
FeatUp
▪ Choosing an Upsampler
• Joint Bilateral Upsampler (JBU)
𝐹ℎ𝑟 = JBU ⋅, 𝑥 ∘ JBU ⋅, 𝑥 ∘ ⋯ 𝑓 𝑥
• This architecture is fast, directly incorporates high-frequency details from the input image into upsampling process
• Generalizes the original JBU to high-dimensional signals and makes this operation learnable
෠
𝐹ℎ𝑟 𝑖, 𝑗 =
1
𝑍
෍
𝑎,𝑏 ∈Ω
𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏
• 𝐺 : a high-resolution signal as guidance for the low-resolution features 𝐹𝑙𝑟
• Ω : a neighborhood of each pixel in the guidance (a 3x3 square centered at each pixel)
• 𝑘 ⋅,⋅ : a similarity kernel that measures how “close” two vectors are
• 𝑍 : a normalization factor to ensure the kernel sums to 1
FeatUp
▪ Choosing an Upsampler
• Joint Bilateral Upsampler (JBU)
෠
𝐹ℎ𝑟 𝑖, 𝑗 =
1
𝑍
෍
𝑎,𝑏 ∈Ω
𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏
• 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 : a learnable Gaussian kernel on the Euclidean distance between coordinate vectors of with 𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙
𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑥, 𝑦 = exp
− 𝑥 − 𝑦 2
2
2𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙
2
• 𝑘𝑟𝑎𝑛𝑔𝑒 : a temperature-weighted softmax applied to the inner products from MLP that operates on the guidance signal 𝐺
𝑘𝑟𝑎𝑛𝑔𝑒 𝑥, 𝑦 = softmax 𝑎,𝑏 ∈Ω
1
𝜎𝑟𝑎𝑛𝑔𝑒
2 𝑀𝐿𝑃 𝐺 𝑖, 𝑗 ⋅ 𝑀𝐿𝑃 𝐺 𝑎, 𝑏
• 𝜎𝑟𝑎𝑛𝑔𝑒
2
acts as the temperature
FeatUp
▪ Choosing an Upsampler
• Joint Bilateral Upsampler (JBU)
• An efficient CUDA implementation of the spatially adaptive kernel used in the JBU
FeatUp
▪ Choosing an Upsampler
• Implicit
• Draws a direct analogy with NeRF by parametrizing the high-res features of a single image with an implicit function 𝐹ℎ𝑟 = MLP 𝑧
• Uses a small MLP to map image coordinates and intensities to a high-dimensional feature for the given location
• Uses Fourier features to improve the spatial resolution of our implicit representations
• Adding Fourier color features allows the network to use high-frequency color information from the original image
𝐹ℎ𝑟 = MLP ℎ 𝑒𝑖: 𝑒𝑗: 𝑥, ෝ
𝑤
• ℎ 𝑧, ෝ
𝑤 : the component-wise discrete Fourier transform of an input signal 𝑧, with a vector of frequencies ෝ
𝑤
• 𝑒𝑖. 𝑒𝑗 : the two-dimensional pixel coordinate fields ranging in the interval [-1, 1]
• : : concatenation along the channel dimension
• MLP is a small 3-layer ReLU network with dropout (p=.1) and layer normalization
• At test time, the pixel coordinate field can be quired to yield features 𝐹ℎ𝑟 at any resolution
FeatUp
▪ Additional Method Details
• Accelerated Training with Feature Compression
• To reduce the memory footprint and further speed up the training of FeatUp’s implicit network, compress the spatially-varying
features to their top k = 128 principal components
• This operation is approximately lossless as the top 128 components explain ∼ 96% of the variance across a single image’s
features
• This improves training time by a factor of 60× for ResNet-50, reduces the memory footprint, enables larger batches, and does
not have any observable effect on learned feature quality
• When training the JBU upsampler, we sample random projection matrices in each batch to avoid computing PCA in the inner loop
• Total Variation Prior
• To avoid spurious noise in the high resolution features, add a small (𝜆𝑡𝑣 = 0.05) total variation smoothness prior on the implicit
feature magnitudes
ℒ𝑡𝑣 = ෍
𝑖,𝑗
𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖 − 1, 𝑗 2
+ 𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖, 𝑗 − 1 2
Experiments
▪ Comparison with several key upsampling baselines
• Bilinear upsampling
• Resize-conv
• Stride (i.e. reducing the stride of the backbone’s patch extractor)
• Large Image (i.e. using a larger input image)
• CARAFE (Want et al., 2019)
• SAPA (Lu et al., 2022)
• FADE (Lu et al., 2022)
▪ Upsample ViT features by 16x (to the resolution of the input image) with every method except the strided and large-
image baselines, which are computationally infeasible above 8x upsampling
Experiments
▪ Qualitative Comparisons
• Visualizing upsampling methods
Experiments
▪ Qualitative Comparisons
• Robustness across vision backbones
Experiments
▪ Transfer Learning for Semantic Segmentation and Depth Estimation
• Train linear probes on top of low-resolution features for both semantic segmentation and depth estimation
• Use a frozen pre-trained ViT-S/16 → Upsample the features (14x14 → 224x224) → extract maps by applying a linear layer
• Semantic segmentation
• Train a linear projection to predict the coarse classes of the COCO-Stuff (27 classes) training dataset using a cross-entropy loss
• Depth prediction
• Train on pseudo-labels from the MiDaS (DPT-Hybrid) depth estimation network using their scale- and shift-invariant MSE
Experiments
▪ Class Activation Map Quality
• FeatUp features can be dropped into existing CAM analyses to yield stronger and more precise explanations
• Metrics for CAM quality : Average Drop (A.D., σ𝑖=1
𝑁 max 0,𝑌𝑖
𝑐
−𝑂𝑖
𝑐
𝑌𝑖
𝑐 ⋅ 100) and Average Increase (A.I., σ𝑖=1
𝑁
1𝑌𝑖
𝑐<𝑂𝑖
𝑐
𝑁
⋅ 100)
• 𝑌𝑖
𝑐
: the classifier’s softmax output on sample 𝑖 for class 𝑐
• 𝑂𝑖
𝑐
: the classifier’s softmax output on the CAM-masked sample 𝑖 for class 𝑐
Experiments
Experiments
▪ End-to-End Semantic Segmentation
• FeatUp not only improves the resolution of pre-trained features but can also improve models learned end-to-end
• Train SegFormer using ADE20k dataset with JBU upsampler
Conclusion
▪ FeatUp is to upsample deep features using multiview consistency
▪ JBU-based upsampler imposes strong spatial priors to accurately recover lost spatial
information with a fast feedforward network based on a novel generalization of Joint Bilateral
Upsampling
▪ Implicit FeatUp can learn high quality features at arbitrary resolutions
▪ Both variants dramatically outperform a wide range of baselines across linear probe transfer
learning, model interpretability, and end-to-end semantic segmentation

More Related Content

PPTX
A Fully Progressive approach to Single image super-resolution
PDF
PR-217: EfficientDet: Scalable and Efficient Object Detection
PPTX
Cvpr 2018 papers review (efficient computing)
PPTX
SPPNet
PPTX
Deep Learning
PPTX
Introduction to Deep Learning
PPTX
An Introduction to Deep Learning
PPTX
Rethinking Attention with Performers
A Fully Progressive approach to Single image super-resolution
PR-217: EfficientDet: Scalable and Efficient Object Detection
Cvpr 2018 papers review (efficient computing)
SPPNet
Deep Learning
Introduction to Deep Learning
An Introduction to Deep Learning
Rethinking Attention with Performers

Similar to FeatUp: A Model-Agnostic Framework for Features at Any Resolution (20)

PDF
HiPEAC 2019 Workshop - Use Cases
PDF
Convolutional Neural Networks : Popular Architectures
PDF
Deep learning for image video processing
PPTX
Artificial Neural Networks presentations
PDF
DQN Variants: A quick glance
PDF
Super resolution in deep learning era - Jaejun Yoo
PPTX
Introduction to Convolutional Neural Networks (CNNs).pptx
PPTX
Unit-5.pptx notes for artificial intelligence
PDF
Convolutional Neural Networks (CNN)
PDF
201907 AutoML and Neural Architecture Search
PDF
Temporal Superpixels Based on Proximity-Weighted Patch Matching
DOCX
JPM1414 Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...
PDF
Exploring Simple Siamese Representation Learning
PDF
PR-305: Exploring Simple Siamese Representation Learning
PPTX
Backpropagation and computational graph.pptx
PPTX
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
PPTX
Parallel processing using image processing
PDF
Decomposing image generation into layout priction and conditional synthesis
PDF
Dueling network architectures for deep reinforcement learning
PDF
[2020 CVPR Efficient DET paper review]
HiPEAC 2019 Workshop - Use Cases
Convolutional Neural Networks : Popular Architectures
Deep learning for image video processing
Artificial Neural Networks presentations
DQN Variants: A quick glance
Super resolution in deep learning era - Jaejun Yoo
Introduction to Convolutional Neural Networks (CNNs).pptx
Unit-5.pptx notes for artificial intelligence
Convolutional Neural Networks (CNN)
201907 AutoML and Neural Architecture Search
Temporal Superpixels Based on Proximity-Weighted Patch Matching
JPM1414 Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...
Exploring Simple Siamese Representation Learning
PR-305: Exploring Simple Siamese Representation Learning
Backpropagation and computational graph.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
Parallel processing using image processing
Decomposing image generation into layout priction and conditional synthesis
Dueling network architectures for deep reinforcement learning
[2020 CVPR Efficient DET paper review]
Ad

More from Sungchul Kim (20)

PDF
SAM2: Segment Anything in Images and Videos
PDF
Personalize Segment Anything Model with One Shot
PDF
TOOD: Task-aligned One-stage Object Detection
PDF
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
PDF
Network Representation Analysis using Centered Kernel Alignment (CKA)
PDF
Review. Dense Prediction Tasks for SSL
PPTX
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
PDF
Revisiting the Calibration of Modern Neural Networks
PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
Score based Generative Modeling through Stochastic Differential Equations
PDF
Revisiting the Sibling Head in Object Detector
PDF
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
PDF
Deeplabv1, v2, v3, v3+
PDF
Going Deeper with Convolutions
PDF
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
PDF
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
PDF
Panoptic Segmentation
PDF
On the Variance of the Adaptive Learning Rate and Beyond
PDF
A Benchmark for Interpretability Methods in Deep Neural Networks
PDF
KDGAN: Knowledge Distillation with Generative Adversarial Networks
SAM2: Segment Anything in Images and Videos
Personalize Segment Anything Model with One Shot
TOOD: Task-aligned One-stage Object Detection
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Network Representation Analysis using Centered Kernel Alignment (CKA)
Review. Dense Prediction Tasks for SSL
PR-343: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
Revisiting the Calibration of Modern Neural Networks
Emerging Properties in Self-Supervised Vision Transformers
Score based Generative Modeling through Stochastic Differential Equations
Revisiting the Sibling Head in Object Detector
Do Wide and Deep Networks Learn the Same Things: Uncovering How Neural Networ...
Deeplabv1, v2, v3, v3+
Going Deeper with Convolutions
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Generalized Intersection over Union: A Metric and A Loss for Bounding Box Reg...
Panoptic Segmentation
On the Variance of the Adaptive Learning Rate and Beyond
A Benchmark for Interpretability Methods in Deep Neural Networks
KDGAN: Knowledge Distillation with Generative Adversarial Networks
Ad

Recently uploaded (20)

PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Construction Project Organization Group 2.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Mechanical Engineering MATERIALS Selection
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Welding lecture in detail for understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
Construction Project Organization Group 2.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mechanical Engineering MATERIALS Selection
Embodied AI: Ushering in the Next Era of Intelligent Systems
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Welding lecture in detail for understanding
bas. eng. economics group 4 presentation 1.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Foundation to blockchain - A guide to Blockchain Tech
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

  • 2. Contents ▪ Introduction ▪ FeatUp ▪ Experiments ▪ Conclusions
  • 4. Introduction ▪ Deep features often sacrifice spatial resolution for semantic quality • ResNet-50 produces 7 × 7 deep features from a 224 × 224 pixel input (32× resolution reduction) • Vision Transformers (ViTs) incur a significant resolution reduction ▪ FeatUp: a novel framework to improve the resolution of any vision model’s features • inspired by 3D reconstruction frameworks like NeRF that is multi-view consistency of low-resolution signals can supervise the construction of high-resolution signals
  • 5. Introduction ▪ Proposed method, FeatUp • to significantly improve the spatial resolution of any model’s features, parametrized as either a fast feedforward upsampling network or an implicit network • with a fast CUDA implementation of Joint Bilateral Upsampling orders of magnitude more efficient • can be used as drop-in replacements for ordinary features to improve performance on dense prediction tasks and model explainability
  • 6. FeatUp ▪ Intuition • Compute high-resolution features by observing multiple different “views” • Two steps in the pipeline Step 1. generate low-resolution feature views to refine into a single high resolution output Step 2. construct a consistent high-resolution feature map from these views
  • 7. FeatUp ▪ Step 1 • Generate low-resolution feature views to refine into a single high resolution output • Perturb the input image with small pads, scales, and horizontal flips • Apply the model to each transformed image to extract a collection of low-resolution feature maps
  • 8. FeatUp ▪ Step 2 • Construct a consistent high-resolution feature map from these views • Track the parameters used to “jitter” each image instead of using estimating parameters • Apply the same transformation to our learned high-resolution features prior to downsampling • Compare downsampled features to the true model outputs using a gaussian likelihood loss
  • 9. FeatUp ▪ Reconstruction Loss ℒ𝒓𝒆𝒄 = 𝟏 𝑻 ෍ 𝒕∈𝑻 𝟏 𝟐𝒔𝟐 𝒇 𝒕 𝒙 − 𝝈↓ 𝒕 𝑭𝒉𝒓 𝟐 𝟐 + 𝒍𝒐𝒈 𝒔 • 𝑡 ∈ 𝑇 : a collection of small transforms such as pads, zooms, crops, horizontal flips, and their compositions • 𝑥 : an input image • 𝑓 : model backbone • 𝜎↓ : a learned downsampler • 𝜎↑ : a learned upsampler • 𝐹ℎ𝑟 = 𝜎↑ 𝑓 𝑥 , 𝑥 : the predicted high-res features • 𝑠 = 𝒩 𝑓 𝑡 𝑥 : a spatially-varying adaptive uncertainty • 𝒩 : a small linear network
  • 10. FeatUp ▪ Choosing a Downsampler • Two options: • A fast and simple learned blur kernel • A more flexible attention-based downsampler
  • 11. FeatUp ▪ Choosing a Downsampler • A fast and simple learned blur kernel • Blurs the features with a learned blur kernel and can be implemented as a convolution applied independently to each channel • Normalized to be non-negative and sum to 1 to ensure the features remain in the same space • Cannot capture dynamic receptive fields, object salience, or other nonlinear effects → DO NOT USE
  • 12. FeatUp ▪ Choosing a Downsampler • A more flexible attention-based downsampler • Uses a 1x1 convolution to predict a saliency map from the high-resolution features • Combines this saliency map with learned spatially-invariant weight and bias kernels • Normalizes the results to create a spatially-varying blur kernel that interpolates the features 𝜎↓ 𝐹ℎ𝑟 𝑖𝑗 = softmax 𝑤⨀Conv 𝐹ℎ𝑟 Ω𝑖𝑗 + 𝑏 ⋅ 𝐹ℎ𝑟 Ω𝑖𝑗 • 𝐹ℎ𝑟 Ω𝑖𝑗 : a patch of high resolution features corresponding to the 𝑖, 𝑗 location in the downsampled features • The main hyperparameter for the downsampler is the kernel size, which should be larger for models with larger receptive fields
  • 13. FeatUp ▪ Choosing an Upsampler • Two variants: • “JBU” FeatUp parameterizes 𝜎↑ with a guided upsampler based on a stack of Joint Bilateral Upsamplers (JBU) • Learns an upsampling strategy that generalizes across a corpus of images • “Implicit” FeatUp uses an implicit network to parameterize 𝜎↑ and can yield remarkably crisp features when overfit to a single image • Both methods are trained using the same broader architecture and loss
  • 14. FeatUp ▪ Choosing an Upsampler • Joint Bilateral Upsampler (JBU) 𝐹ℎ𝑟 = JBU ⋅, 𝑥 ∘ JBU ⋅, 𝑥 ∘ ⋯ 𝑓 𝑥 • This architecture is fast, directly incorporates high-frequency details from the input image into upsampling process • Generalizes the original JBU to high-dimensional signals and makes this operation learnable ෠ 𝐹ℎ𝑟 𝑖, 𝑗 = 1 𝑍 ෍ 𝑎,𝑏 ∈Ω 𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏 • 𝐺 : a high-resolution signal as guidance for the low-resolution features 𝐹𝑙𝑟 • Ω : a neighborhood of each pixel in the guidance (a 3x3 square centered at each pixel) • 𝑘 ⋅,⋅ : a similarity kernel that measures how “close” two vectors are • 𝑍 : a normalization factor to ensure the kernel sums to 1
  • 15. FeatUp ▪ Choosing an Upsampler • Joint Bilateral Upsampler (JBU) ෠ 𝐹ℎ𝑟 𝑖, 𝑗 = 1 𝑍 ෍ 𝑎,𝑏 ∈Ω 𝐹𝑙𝑟 𝑎, 𝑏 𝑘𝑟𝑎𝑛𝑔𝑒 𝐺 𝑖, 𝑗 , 𝐺 𝑎, 𝑏 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑖, 𝑗 , 𝑎, 𝑏 • 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 : a learnable Gaussian kernel on the Euclidean distance between coordinate vectors of with 𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑘𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑥, 𝑦 = exp − 𝑥 − 𝑦 2 2 2𝜎𝑠𝑝𝑎𝑡𝑖𝑎𝑙 2 • 𝑘𝑟𝑎𝑛𝑔𝑒 : a temperature-weighted softmax applied to the inner products from MLP that operates on the guidance signal 𝐺 𝑘𝑟𝑎𝑛𝑔𝑒 𝑥, 𝑦 = softmax 𝑎,𝑏 ∈Ω 1 𝜎𝑟𝑎𝑛𝑔𝑒 2 𝑀𝐿𝑃 𝐺 𝑖, 𝑗 ⋅ 𝑀𝐿𝑃 𝐺 𝑎, 𝑏 • 𝜎𝑟𝑎𝑛𝑔𝑒 2 acts as the temperature
  • 16. FeatUp ▪ Choosing an Upsampler • Joint Bilateral Upsampler (JBU) • An efficient CUDA implementation of the spatially adaptive kernel used in the JBU
  • 17. FeatUp ▪ Choosing an Upsampler • Implicit • Draws a direct analogy with NeRF by parametrizing the high-res features of a single image with an implicit function 𝐹ℎ𝑟 = MLP 𝑧 • Uses a small MLP to map image coordinates and intensities to a high-dimensional feature for the given location • Uses Fourier features to improve the spatial resolution of our implicit representations • Adding Fourier color features allows the network to use high-frequency color information from the original image 𝐹ℎ𝑟 = MLP ℎ 𝑒𝑖: 𝑒𝑗: 𝑥, ෝ 𝑤 • ℎ 𝑧, ෝ 𝑤 : the component-wise discrete Fourier transform of an input signal 𝑧, with a vector of frequencies ෝ 𝑤 • 𝑒𝑖. 𝑒𝑗 : the two-dimensional pixel coordinate fields ranging in the interval [-1, 1] • : : concatenation along the channel dimension • MLP is a small 3-layer ReLU network with dropout (p=.1) and layer normalization • At test time, the pixel coordinate field can be quired to yield features 𝐹ℎ𝑟 at any resolution
  • 18. FeatUp ▪ Additional Method Details • Accelerated Training with Feature Compression • To reduce the memory footprint and further speed up the training of FeatUp’s implicit network, compress the spatially-varying features to their top k = 128 principal components • This operation is approximately lossless as the top 128 components explain ∼ 96% of the variance across a single image’s features • This improves training time by a factor of 60× for ResNet-50, reduces the memory footprint, enables larger batches, and does not have any observable effect on learned feature quality • When training the JBU upsampler, we sample random projection matrices in each batch to avoid computing PCA in the inner loop • Total Variation Prior • To avoid spurious noise in the high resolution features, add a small (𝜆𝑡𝑣 = 0.05) total variation smoothness prior on the implicit feature magnitudes ℒ𝑡𝑣 = ෍ 𝑖,𝑗 𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖 − 1, 𝑗 2 + 𝐹ℎ𝑟 𝑖, 𝑗 − 𝐹ℎ𝑟 𝑖, 𝑗 − 1 2
  • 19. Experiments ▪ Comparison with several key upsampling baselines • Bilinear upsampling • Resize-conv • Stride (i.e. reducing the stride of the backbone’s patch extractor) • Large Image (i.e. using a larger input image) • CARAFE (Want et al., 2019) • SAPA (Lu et al., 2022) • FADE (Lu et al., 2022) ▪ Upsample ViT features by 16x (to the resolution of the input image) with every method except the strided and large- image baselines, which are computationally infeasible above 8x upsampling
  • 20. Experiments ▪ Qualitative Comparisons • Visualizing upsampling methods
  • 21. Experiments ▪ Qualitative Comparisons • Robustness across vision backbones
  • 22. Experiments ▪ Transfer Learning for Semantic Segmentation and Depth Estimation • Train linear probes on top of low-resolution features for both semantic segmentation and depth estimation • Use a frozen pre-trained ViT-S/16 → Upsample the features (14x14 → 224x224) → extract maps by applying a linear layer • Semantic segmentation • Train a linear projection to predict the coarse classes of the COCO-Stuff (27 classes) training dataset using a cross-entropy loss • Depth prediction • Train on pseudo-labels from the MiDaS (DPT-Hybrid) depth estimation network using their scale- and shift-invariant MSE
  • 23. Experiments ▪ Class Activation Map Quality • FeatUp features can be dropped into existing CAM analyses to yield stronger and more precise explanations • Metrics for CAM quality : Average Drop (A.D., σ𝑖=1 𝑁 max 0,𝑌𝑖 𝑐 −𝑂𝑖 𝑐 𝑌𝑖 𝑐 ⋅ 100) and Average Increase (A.I., σ𝑖=1 𝑁 1𝑌𝑖 𝑐<𝑂𝑖 𝑐 𝑁 ⋅ 100) • 𝑌𝑖 𝑐 : the classifier’s softmax output on sample 𝑖 for class 𝑐 • 𝑂𝑖 𝑐 : the classifier’s softmax output on the CAM-masked sample 𝑖 for class 𝑐
  • 25. Experiments ▪ End-to-End Semantic Segmentation • FeatUp not only improves the resolution of pre-trained features but can also improve models learned end-to-end • Train SegFormer using ADE20k dataset with JBU upsampler
  • 26. Conclusion ▪ FeatUp is to upsample deep features using multiview consistency ▪ JBU-based upsampler imposes strong spatial priors to accurately recover lost spatial information with a fast feedforward network based on a novel generalization of Joint Bilateral Upsampling ▪ Implicit FeatUp can learn high quality features at arbitrary resolutions ▪ Both variants dramatically outperform a wide range of baselines across linear probe transfer learning, model interpretability, and end-to-end semantic segmentation