SlideShare a Scribd company logo
Masked Self-supervised
Pre-training for Visual
Recognition
by: Jefferson Hernandez
There has been a divergence between how we do
pre-training in Vision vs NLP
NLP models are usually are pre-trained using masked or autoregressive methods:
Masked language model Autoregressive language model
Images from: Jay Alammar'blog
Instead the most successful pre-training in Vision is done using
contrastive methods
SimCLR (from Ziyan's talk)
How can we make Vision pre-training more
similar to NLP pre-training?
Masked and autoregressive methods in NLP are at heart
Denoising autoencoders
● They are a class of autoencoder that corrupt the input and ask the model to
predict the un-corrupted version
● For images this would mean applying geometric transformations, color
transformations, masking pixels, shuffluling pixels, etc
Masked image modelling (MIM) has been done using
convolutions
The paper Context Encoders: Feature Learning by Inpainting (2016), is the
pioneer of masked image modelling, using convolutional neural networks to fill out
masked part of an image.
CNN Encoder CNN
Decoder
But the results are very poor…...
So the authors need to add an adversarial loss (GAN) to get better visual results
but even then fine-tuning accuracies were low for today’s standard
Can we do better than this?
How to tokenize images the same way as text?
The paper AN IMAGE IS WORTH 16X16 WORDS introduces the main way to
tokenize images for transformers, just split then into patches of 16 by 16 pixels
and pass then through a linear layer
(MAE) Masked Autoencoders Are Scalable Vision
Learners
● With the introduction of ViT, we can do masked image modelling the same
way we do mask language modelling in BERT.
● Unlike BERT, MAE uses an asymmetric design. The encoder only operates
on the masked input (No [MASKED] token) and a lightweight decoder that
reconstructs the full signal from the latent representation and [MASKED]
tokens.
MAE Architecture
1) Mask
original
image
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
5) L2 pixel
Loss
MAE Architecture
1) Mask
original
image
2) Encode
visible
tokens
3) Add [M]
tokens
4) Predict
image
5) L2 pixel
Loss
Qualitative Results
Qualitative Results
Qualitative Results
Results
The authors do self-supervised pre-training on the ImageNet-1K (IN1K) training
set. Then they do supervised training to evaluate the representations with (i)
end-to-end fine-tuning or (ii) linear probing.
Baseline model: ViT-Large:
● ViT-Large (ViT-L/16) is the backbone in their ablation study.
● ViT-L is very big and tends to overfit.
● It is very hard to train supervised ViT-L from scratch and a good recipe with
strong regularization is needed .
We need high masking ratios
● The optimal ratios are surprisingly
high. The ratio of 75% is good for both
linear probing and fine-tuning.
● This is in contrast with BERT(15%)
and similar works in CV(20% - 50%)
● For linear probing, the accuracy
increases steadily with the masking
ratio until 75% masking: the accuracy
gap is up to ∼20% (54.6% vs.73.5%).
For fine-tuning, the results are less
sensitive to the ratios, and a wide
range of masking ratios (40–80%)
work well.
Mask Token
● If the encoder uses mask tokens, it
performs worse: its accuracy drops
by 14% in linear probing.
● By removing the mask token from
the encoder, They constrain the
encoder to always see real patches
and thus improve accuracy.
Reconstruction target
● Using pixels with normalization improves accuracy.
● In another variant, the authors perform PCA in the patch space and use the
largest PCA coefficients (96 here) as the target. Doing so degrades accuracy.
● The authors also compare an MAE variant that predicts tokens, the target
used in BEiT. Specifically for this variant, they use the DALLE pre-trained
dVAE as the tokenizer, following BEiT.
● The dVAE tokenizer requires one more pre-training stage, which may depend
on extra data (250M images). The dVAE encoder is a large convolutional
network (40% FLOPs of ViT-L) and adds nontrivial overhead.
Comparison with other
self-supervised Methods
Comparison with
supervised pre-training
Transfer learning experiments
● Object detection and instance segmentation
○ Mask R-CNN is finetuned on COCO. The ViT backbone is adapted to work with FPN.
● Semantic segmentation:
○ Experiments on ADE20K use UperNet and ViT as backbone.
Extending MAE to other modalities (Video)
Masked Autoencoders As Spatiotemporal Learners
● Basic idea: extend MAE to spatiotemporal learning
How to mask spatiotemporal data?
(a): Random sampling that is spacetime-agnostic. (b): Space-only random
sampling, broadcasted to all time steps (“tube” masking). (c): Time-only random
sampling, broadcasted to all spatial locations (“frame” masking). (d): Block-wise
sampling in spacetime, removing large regions (“cube” masking).
What is the optimal masking ratio for spatiotemporal data?
Optimal is ~90% much higher than in images.
Qualitative Results
Qualitative Results
Influence of pre-training data
Results on Kinetics Dataset (400)
Multi-Model MAE (Img+Text)
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Masked Vision and Language Modeling for Multi-modal
Representation Learning
Basic idea: Model p(img | text) and p(text | img)
Where
Qualitative Results
They don't show image reconstructions
Image-Text Retrieval (Finetuned)
Image-Text Retrieval (Zero-Shot)
Retrieval is done using img_features @ text_features^T
Visual Question Answering (VQA) and Natural Language
for Visual Reasoning (NLVR)
VQA NLVR
Results

More Related Content

PPTX
Masked Autoencoders Are Scalable Vision Learners.pptx
PDF
PR-355: Masked Autoencoders Are Scalable Vision Learners
PDF
Image Masking.pdf
PDF
Sign Detection from Hearing Impaired
PPTX
False colouring
PDF
Mirko Lucchese - Deep Image Processing
PDF
深度學習在AOI的應用
PPTX
Ppt on Regularization, batch normamalization.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
PR-355: Masked Autoencoders Are Scalable Vision Learners
Image Masking.pdf
Sign Detection from Hearing Impaired
False colouring
Mirko Lucchese - Deep Image Processing
深度學習在AOI的應用
Ppt on Regularization, batch normamalization.pptx

Similar to jefferson-mae Masked Autoencoders based Pretraining (20)

PPTX
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
DOC
PIES_Profile_INDIA
PDF
Color Detection & Segmentation based Invisible Cloak
PDF
What multimodal foundation models cannot perceive
PDF
IMAGE CAPTION GENERATOR USING DEEP LEARNING
PPTX
Industrial Trainingdbhkbdbdwjb dbxjnwbndcbj
PDF
Automated Speech Recognition
PDF
Image Classification using Deep Learning
PPTX
Image attendance system
PPTX
deep fake detection deep fake detection a
PPTX
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
PPTX
Automated_attendance_system_project.pptx
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PDF
IRJET - Visual Question Answering – Implementation using Keras
PDF
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
PDF
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
PPTX
Face Detection.pptx
PDF
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
PPTX
[NS][Lab_Seminar_250203]KAG-prompt (1).pptx
PPTX
[NS][Lab_Seminar_250203]KAG-prompt (1).pptx
ML Paper Tutorial - Video Face Manipulation Detection Through Ensemble of CNN...
PIES_Profile_INDIA
Color Detection & Segmentation based Invisible Cloak
What multimodal foundation models cannot perceive
IMAGE CAPTION GENERATOR USING DEEP LEARNING
Industrial Trainingdbhkbdbdwjb dbxjnwbndcbj
Automated Speech Recognition
Image Classification using Deep Learning
Image attendance system
deep fake detection deep fake detection a
COVID-19-Preventions-Control-System and Unconstrained Face-mask and Face-hand...
Automated_attendance_system_project.pptx
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
IRJET - Visual Question Answering – Implementation using Keras
IRJET- Efficient Face Detection from Video Sequences using KNN and PCA
MINR: Implicit Neural Representations with Masked Image Modelling (ICCV '23 O...
Face Detection.pptx
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
[NS][Lab_Seminar_250203]KAG-prompt (1).pptx
[NS][Lab_Seminar_250203]KAG-prompt (1).pptx
Ad

Recently uploaded (20)

PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPT
Total quality management ppt for engineering students
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
Current and future trends in Computer Vision.pptx
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
Safety Seminar civil to be ensured for safe working.
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Abrasive, erosive and cavitation wear.pdf
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Total quality management ppt for engineering students
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Current and future trends in Computer Vision.pptx
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Visual Aids for Exploratory Data Analysis.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Categorization of Factors Affecting Classification Algorithms Selection
86236642-Electric-Loco-Shed.pdf jfkduklg
Ad

jefferson-mae Masked Autoencoders based Pretraining

  • 1. Masked Self-supervised Pre-training for Visual Recognition by: Jefferson Hernandez
  • 2. There has been a divergence between how we do pre-training in Vision vs NLP NLP models are usually are pre-trained using masked or autoregressive methods: Masked language model Autoregressive language model Images from: Jay Alammar'blog
  • 3. Instead the most successful pre-training in Vision is done using contrastive methods
  • 5. How can we make Vision pre-training more similar to NLP pre-training?
  • 6. Masked and autoregressive methods in NLP are at heart Denoising autoencoders ● They are a class of autoencoder that corrupt the input and ask the model to predict the un-corrupted version ● For images this would mean applying geometric transformations, color transformations, masking pixels, shuffluling pixels, etc
  • 7. Masked image modelling (MIM) has been done using convolutions The paper Context Encoders: Feature Learning by Inpainting (2016), is the pioneer of masked image modelling, using convolutional neural networks to fill out masked part of an image. CNN Encoder CNN Decoder
  • 8. But the results are very poor…... So the authors need to add an adversarial loss (GAN) to get better visual results but even then fine-tuning accuracies were low for today’s standard
  • 9. Can we do better than this?
  • 10. How to tokenize images the same way as text? The paper AN IMAGE IS WORTH 16X16 WORDS introduces the main way to tokenize images for transformers, just split then into patches of 16 by 16 pixels and pass then through a linear layer
  • 11. (MAE) Masked Autoencoders Are Scalable Vision Learners ● With the introduction of ViT, we can do masked image modelling the same way we do mask language modelling in BERT. ● Unlike BERT, MAE uses an asymmetric design. The encoder only operates on the masked input (No [MASKED] token) and a lightweight decoder that reconstructs the full signal from the latent representation and [MASKED] tokens.
  • 14. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens
  • 15. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image
  • 16. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image 5) L2 pixel Loss
  • 17. MAE Architecture 1) Mask original image 2) Encode visible tokens 3) Add [M] tokens 4) Predict image 5) L2 pixel Loss
  • 21. Results The authors do self-supervised pre-training on the ImageNet-1K (IN1K) training set. Then they do supervised training to evaluate the representations with (i) end-to-end fine-tuning or (ii) linear probing. Baseline model: ViT-Large: ● ViT-Large (ViT-L/16) is the backbone in their ablation study. ● ViT-L is very big and tends to overfit. ● It is very hard to train supervised ViT-L from scratch and a good recipe with strong regularization is needed .
  • 22. We need high masking ratios ● The optimal ratios are surprisingly high. The ratio of 75% is good for both linear probing and fine-tuning. ● This is in contrast with BERT(15%) and similar works in CV(20% - 50%) ● For linear probing, the accuracy increases steadily with the masking ratio until 75% masking: the accuracy gap is up to ∼20% (54.6% vs.73.5%). For fine-tuning, the results are less sensitive to the ratios, and a wide range of masking ratios (40–80%) work well.
  • 23. Mask Token ● If the encoder uses mask tokens, it performs worse: its accuracy drops by 14% in linear probing. ● By removing the mask token from the encoder, They constrain the encoder to always see real patches and thus improve accuracy.
  • 24. Reconstruction target ● Using pixels with normalization improves accuracy. ● In another variant, the authors perform PCA in the patch space and use the largest PCA coefficients (96 here) as the target. Doing so degrades accuracy. ● The authors also compare an MAE variant that predicts tokens, the target used in BEiT. Specifically for this variant, they use the DALLE pre-trained dVAE as the tokenizer, following BEiT. ● The dVAE tokenizer requires one more pre-training stage, which may depend on extra data (250M images). The dVAE encoder is a large convolutional network (40% FLOPs of ViT-L) and adds nontrivial overhead.
  • 25. Comparison with other self-supervised Methods Comparison with supervised pre-training
  • 26. Transfer learning experiments ● Object detection and instance segmentation ○ Mask R-CNN is finetuned on COCO. The ViT backbone is adapted to work with FPN. ● Semantic segmentation: ○ Experiments on ADE20K use UperNet and ViT as backbone.
  • 27. Extending MAE to other modalities (Video)
  • 28. Masked Autoencoders As Spatiotemporal Learners ● Basic idea: extend MAE to spatiotemporal learning
  • 29. How to mask spatiotemporal data? (a): Random sampling that is spacetime-agnostic. (b): Space-only random sampling, broadcasted to all time steps (“tube” masking). (c): Time-only random sampling, broadcasted to all spatial locations (“frame” masking). (d): Block-wise sampling in spacetime, removing large regions (“cube” masking).
  • 30. What is the optimal masking ratio for spatiotemporal data? Optimal is ~90% much higher than in images.
  • 34. Results on Kinetics Dataset (400)
  • 36. Masked Vision and Language Modeling for Multi-modal Representation Learning
  • 37. Masked Vision and Language Modeling for Multi-modal Representation Learning
  • 38. Masked Vision and Language Modeling for Multi-modal Representation Learning Basic idea: Model p(img | text) and p(text | img) Where
  • 39. Qualitative Results They don't show image reconstructions
  • 41. Image-Text Retrieval (Zero-Shot) Retrieval is done using img_features @ text_features^T
  • 42. Visual Question Answering (VQA) and Natural Language for Visual Reasoning (NLVR)