What multimodal foundation models cannot perceive

What multimodal foundation
models cannot perceive
Prof. dr. Cees Snoek
University of Amsterdam
Head of Video & Image Sense lab
Scientific Director Amsterdam AI

Human vision consumes 50% brain power
Van Essen, Science 1992

Human invention of written language
Source: wikipedia

Human invention of ChatGPT
OpenAI, 11/2022

Vision and language even more powerful
1. Collect millions of images and their description from the Internet
2. Learn associations between image and text
3. Generate text based on image, and vice versa
CLIP, 7/2021

What works well in vision and language?
Flamingo, 11/2022

What works well in vision and language?
BLIP-2, 6/2023

This talk
Looks into what multimodal foundation models cannot perceive:
1. Scarcity
2. Space
3. Time
4. Human values

1. Scarcity
Work in progress with Yunhua Zhang & Hazel Doughty.

Low-Resource Natural Language Processing
No previous works on low-resource vision tasks.
Hedderich et al. ArXiv 2020

High-resource vs. Low-resource

Circuit diagram classification

Low-Resource Image Transfer Evaluation
Task Formulation Train Val Test
Circuit Diagram Classification Image Classification 154 100 1,078
Historic Map Retrieval Image-to-Image Retrieval 102 140 409
Mechanical Drawing Retrieval Image-to-Image Retrieval 300 100 754
Number of images (or image pairs) per split

Poor performance for low-resource vision challenges
0 10 20 30 40 50 60 70 80 90 100
Mechanical Drawing Retrieval
Historic map retrieval
Circuit diagram classification SAM BLIP CLIP ImageBind

Low-Resource Vision Challenges
Our goal: adapt foundation models, pre-trained on large-scale
datasets, to low-resource tasks.
Challenge I: Data Scarcity
Challenge II: Fine-Grained
Challenge III: Specialized Domain
Baseline I: Generated Data for Data Scarcity
Baseline II: Tokenization for Fine-Grained
Baseline III: Attention for Specialized Domains

Baseline I: Generated Data for Data Scarcity
We generate images close to the input image where the label is preserved
as well as more diverse images which break the label.
𝐿!"#$
𝐿%"&'%(&)'"$*+,
Low-Resource
Image
Label-Preserving
Augmentation Label-Breaking Augmentation
Generative
Model
Generative
Model
𝐿 = 𝐿!"#$ + 𝜆𝐿%"&'%(&)'"$*+,

Circuit diagram examples
FM Transmitter
Label-Preserving Label-Breaking
Original Image

Baseline II: Tokenization for Fine-Grained
As we have limited data we cannot train a tokenization layer from scratch
Instead, we divide the linear projection kernel into sub-kernels for image patches.
Then create patch-level features with a learned weighting
⋮
Original Kernel
Sub-kernels
⋮
𝐤!
𝐤"
𝐤#
𝑤!
𝑤"
𝑤$
Feature Tokens
⋮
Divide
Divide

Baseline III: Attention for Specialized Domains
1. Learn global attention maps
with common patterns particular
to the specialized domain
2. For each token, crop its region
from the global attention map.
3. Combine with multi-head self-
attention.
Cropped Map
Cropped Map
Attention for Specialized Domain
Feature Token

Results of baselines for the three challenges
Our baselines are effective

Effective adapter for several foundation models
0 10 20 30 40 50
Our Baselines
Zero-Shot Transfer
CLIP
Recall@1 ↑
0 5 10 15 20
Our Baselines
Zero-Shot Transfer
BLIP
0 20 40 60 80
Our Baselines
Zero-Shot Transfer
ImageBind
0 5 10 15
Our Baselines
Zero-Shot Transfer
SAM
Recall@1 ↑
Recall@1 ↑
Recall@1 ↑
Results for Historic Map Retrieval

Qualitative results: easy samples
We recognize prominent patterns in low-resource data, such as the
coastline in the map of Sydney.
Power Supply Dice
Model Input
Groundtruth
Sydney, Australia Winnipeg, Canada
Sydney, Australia Winnipeg, Canada

Qualitative results: hard samples
Our predictions are overconfident, often basing predictions on one key region such
as the presence of the battery in the LED circuit.
We cannot yet generalize to rare image styles such as used for the Innsbruck map
Motor Driver
Bell
LED
Audio Amplifier
Innsbruck, Austria
Innsbruck, Austria
Cuneo, Italy
Brugge, Belgium
Leuven, Belgium
Brugge, Belgium
Model Input
Prediction
Groundtruth

2. Space
Work in progress with Michael Dorkenwald, Nimrod Barazani & Yuki Asano.

Special purpose object localization is very mature
w/ Kien Nguyen et al. CVPR 2022 / ICLR 2024
w/ Aritra Bhowmik et al. ICCV 2023

Can vision-language models localize objects?

Perhaps we need another type of prompt?

Can vision-language models do spatial reasoning?

Our proposal
Frozen VLM,
e.g. Flamingo
PIN: positional
learnable prompt
Synthetic
images
Synthetic, unlabeled
images

Data generation
Zhao et al. X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion. ICML 2023

Vanilla Flamingo
Text
Text
Image
Alayrac, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.

1) feed the frozen vision encoder with synthetic data
Text
Text

2) Provide VLM spatial learning capacity
Text
Text

3) Train pasted object locations via next-word prediction

The PIN module unlocks spatial localisation

3. Time
With Piyush Bagad & Makarand Tapaswi. In CVPR 2023.

• Foundation models: Language interface + a few (or no) training samples
The problem
What does this picture show?
“A dog running”

• Foundation models: Language interface + a few (or no) training samples
• Particularly attractive for videos given high cost
The problem
What does this video show?
“A kid eating ice-cream”

• Do video foundation models truly understand time?
The problem
“A kid eating ice-cream”
What does this video show?

• Do video foundation models truly understand time?
• Our idea for a “test of time”: ask questions that have temporal relations
The problem
“False”
The baby eats ice-cream before walking down hill? True or False?

• Synthetic benchmark
• Simple ‘true’ or ‘false’ predictions
The test of time

• We pick a suite of seven openly available video-language models
• While excelling at the control task, they all fail at the time-order task
Existing models fail this test of time
0 20 40 60 80 100
Accuracy (%)
CLIP4Clip
CLIP2Video
CenterCLIP
VideoCLIP
Frozen in Time
VindLU
BridgeFormer
Chance
Control task
Time order task
Chance

How to instil this sense of time?
• Post-pretraining: instead of training from scratch, we run another round of pre-training

• Data: any dense video-captioning dataset!

• Base model: We start with a pre-trained model: VideoCLIP
Xu et al, VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, EMNLP 2021.
[CLS] Baby eats ice-cream
Video Encoder
(BERT)
Text Encoder
(BERT)
S3D features
Mean
Pooling
Video
representation
Sentence
representation

How to instill this sense of time?

4. Human values
Work in progress with the UvA Data Science Center HAVA-Lab.

What multimodal foundation models cannot perceive

What defines human-aligned foundation
models, how can they be made computable,
and what determines their societal acceptance?
How can we embed laws, societal values, and
ethics into the foundation model lifecycle?
Is there one solution for all, or do we need
specialized algorithms for each domain?
Cees
Snoek
Pascal
Mettes
Iris
Groen
Heleen
Janssen
Tobias
Blanke
Marie
Lindegaard
Erwin
Berkhout
Stevan
Rudinac
Marlies
Schijven

Conclusions
Multimodal foundation models are amazing.
But have perceptual difficulty with scarcity, space, time and human values.
Synthetic data generation and small-capacity adapters may help.
Bonus: both sustainable and responsible.
Thank you

Contact info
Prof. dr. Cees Snoek
https://ivi.fnwi.uva.nl/vislab/
@cgmsnoek {x, ellis.social}

What multimodal foundation models cannot perceive

More Related Content

What's hot (20)

Similar to What multimodal foundation models cannot perceive (20)

Recently uploaded (20)

What multimodal foundation models cannot perceive