SlideShare a Scribd company logo
What multimodal foundation
models cannot perceive
Prof. dr. Cees Snoek
University of Amsterdam
Head of Video & Image Sense lab
Scientific Director Amsterdam AI
@@@Kwal
@@@Cambrische explosie
Human vision consumes 50% brain power
Van Essen, Science 1992
Human invention of written language
Source: wikipedia
Human invention of ChatGPT
OpenAI, 11/2022
Vision and language even more powerful
1. Collect millions of images and their description from the Internet
2. Learn associations between image and text
3. Generate text based on image, and vice versa
CLIP, 7/2021
What works well in vision and language?
Flamingo, 11/2022
What works well in vision and language?
BLIP-2, 6/2023
This talk
Looks into what multimodal foundation models cannot perceive:
1. Scarcity
2. Space
3. Time
4. Human values
1. Scarcity
Work in progress with Yunhua Zhang & Hazel Doughty.
Low-Resource Natural Language Processing
No previous works on low-resource vision tasks.
Hedderich et al. ArXiv 2020
High-resource vs. Low-resource
Circuit diagram classification
Historic map retrieval
Mechanical drawing retrieval
Low-Resource Image Transfer Evaluation
Task Formulation Train Val Test
Circuit Diagram Classification Image Classification 154 100 1,078
Historic Map Retrieval Image-to-Image Retrieval 102 140 409
Mechanical Drawing Retrieval Image-to-Image Retrieval 300 100 754
Number of images (or image pairs) per split
Poor performance for low-resource vision challenges
0 10 20 30 40 50 60 70 80 90 100
Mechanical Drawing Retrieval
Historic map retrieval
Circuit diagram classification SAM BLIP CLIP ImageBind
Low-Resource Vision Challenges
Our goal: adapt foundation models, pre-trained on large-scale
datasets, to low-resource tasks.
Challenge I: Data Scarcity
Challenge II: Fine-Grained
Challenge III: Specialized Domain
Baseline I: Generated Data for Data Scarcity
Baseline II: Tokenization for Fine-Grained
Baseline III: Attention for Specialized Domains
Baseline I: Generated Data for Data Scarcity
We generate images close to the input image where the label is preserved
as well as more diverse images which break the label.
𝐿!"#$
𝐿%"&'%(&)'"$*+,
Low-Resource
Image
Label-Preserving
Augmentation Label-Breaking Augmentation
Generative
Model
Generative
Model
𝐿 = 𝐿!"#$ + 𝜆𝐿%"&'%(&)'"$*+,
Circuit diagram examples
FM Transmitter
Label-Preserving Label-Breaking
Original Image
Baseline II: Tokenization for Fine-Grained
As we have limited data we cannot train a tokenization layer from scratch
Instead, we divide the linear projection kernel into sub-kernels for image patches.
Then create patch-level features with a learned weighting
⋮
Original Kernel
Sub-kernels
⋮
𝐤!
𝐤"
𝐤#
𝑤!
𝑤"
𝑤$
Feature Tokens
⋮
Divide
Divide
Baseline III: Attention for Specialized Domains
1. Learn global attention maps
with common patterns particular
to the specialized domain
2. For each token, crop its region
from the global attention map.
3. Combine with multi-head self-
attention.
Cropped Map
Cropped Map
Attention for Specialized Domain
Feature Token
Results of baselines for the three challenges
Our baselines are effective
Effective adapter for several foundation models
0 10 20 30 40 50
Our Baselines
Zero-Shot Transfer
CLIP
Recall@1 ↑
0 5 10 15 20
Our Baselines
Zero-Shot Transfer
BLIP
0 20 40 60 80
Our Baselines
Zero-Shot Transfer
ImageBind
0 5 10 15
Our Baselines
Zero-Shot Transfer
SAM
Recall@1 ↑
Recall@1 ↑
Recall@1 ↑
Results for Historic Map Retrieval
Qualitative results: easy samples
We recognize prominent patterns in low-resource data, such as the
coastline in the map of Sydney.
Power Supply Dice
Model Input
Groundtruth
Sydney, Australia Winnipeg, Canada
Sydney, Australia Winnipeg, Canada
Qualitative results: hard samples
Our predictions are overconfident, often basing predictions on one key region such
as the presence of the battery in the LED circuit.
We cannot yet generalize to rare image styles such as used for the Innsbruck map
Motor Driver
Bell
LED
Audio Amplifier
Innsbruck, Austria
Innsbruck, Austria
Cuneo, Italy
Brugge, Belgium
Leuven, Belgium
Brugge, Belgium
Model Input
Prediction
Groundtruth
2. Space
Work in progress with Michael Dorkenwald, Nimrod Barazani & Yuki Asano.
Special purpose object localization is very mature
w/ Kien Nguyen et al. CVPR 2022 / ICLR 2024
w/ Aritra Bhowmik et al. ICCV 2023
Can vision-language models localize objects?
Perhaps we need another type of prompt?
Can vision-language models do spatial reasoning?
Our proposal
Frozen VLM,
e.g. Flamingo
PIN: positional
learnable prompt
Synthetic
images
Synthetic, unlabeled
images
Data generation
Zhao et al. X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion. ICML 2023
Vanilla Flamingo
Text
Text
Image
Alayrac, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
1) feed the frozen vision encoder with synthetic data
Text
Text
2) Provide VLM spatial learning capacity
Text
Text
3) Train pasted object locations via next-word prediction
The PIN module unlocks spatial localisation
The PIN module unlocks spatial localisation
3. Time
With Piyush Bagad & Makarand Tapaswi. In CVPR 2023.
• Foundation models: Language interface + a few (or no) training samples
The problem
What does this picture show?
“A dog running”
• Foundation models: Language interface + a few (or no) training samples
• Particularly attractive for videos given high cost
The problem
What does this video show?
“A kid eating ice-cream”
• Do video foundation models truly understand time?
The problem
“A kid eating ice-cream”
What does this video show?
• Do video foundation models truly understand time?
• Our idea for a “test of time”: ask questions that have temporal relations
The problem
“False”
The baby eats ice-cream before walking down hill? True or False?
• Synthetic benchmark
• Simple ‘true’ or ‘false’ predictions
The test of time
• We pick a suite of seven openly available video-language models
• While excelling at the control task, they all fail at the time-order task
Existing models fail this test of time
0 20 40 60 80 100
Accuracy (%)
CLIP4Clip
CLIP2Video
CenterCLIP
VideoCLIP
Frozen in Time
VindLU
BridgeFormer
Chance
Control task
Time order task
Chance
How to instil this sense of time?
• Post-pretraining: instead of training from scratch, we run another round of pre-training
How to instil this sense of time?
• Data: any dense video-captioning dataset!
How to instil this sense of time?
• Data: any dense video-captioning dataset!
How to instil this sense of time?
• Data: any dense video-captioning dataset!
How to instil this sense of time?
• Data: any dense video-captioning dataset!
How to instil this sense of time?
• Base model: We start with a pre-trained model: VideoCLIP
Xu et al, VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, EMNLP 2021.
[CLS] Baby eats ice-cream
Video Encoder
(BERT)
Text Encoder
(BERT)
S3D features
Mean
Pooling
Video
representation
Sentence
representation
How to instill this sense of time?
How to instill this sense of time?
Experiments
Experiments
4. Human values
Work in progress with the UvA Data Science Center HAVA-Lab.
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
What defines human-aligned foundation
models, how can they be made computable,
and what determines their societal acceptance?
How can we embed laws, societal values, and
ethics into the foundation model lifecycle?
Is there one solution for all, or do we need
specialized algorithms for each domain?
Cees
Snoek
Pascal
Mettes
Iris
Groen
Heleen
Janssen
Tobias
Blanke
Marie
Lindegaard
Erwin
Berkhout
Stevan
Rudinac
Marlies
Schijven
Conclusions
Multimodal foundation models are amazing.
But have perceptual difficulty with scarcity, space, time and human values.
Synthetic data generation and small-capacity adapters may help.
Bonus: both sustainable and responsible.
Thank you
Contact info
Prof. dr. Cees Snoek
https://ivi.fnwi.uva.nl/vislab/
@cgmsnoek {x, ellis.social}

More Related Content

PDF
Introduction to Multimodal LLMs with LLaVA
PDF
Application of Foundation Model for Autonomous Driving
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PDF
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
Transformers in 2021
PPTX
The Beginner's Guide To Large Language Models
PDF
Transformers
Introduction to Multimodal LLMs with LLaVA
Application of Foundation Model for Autonomous Driving
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Transformers in 2021
The Beginner's Guide To Large Language Models
Transformers

What's hot (20)

PDF
Interview preparation workshop
PDF
Mobile signaling threats and vulnerabilities - real cases and statistics from...
PPTX
Multicast routing protocols in adhoc networks
PPT
Memory management in linux
PPT
Adhoc and Sensor Networks - Chapter 02
PPTX
Mobile transport layer - traditional TCP
PPT
Congetion Control.pptx
PPTX
GCC for ARMv8 Aarch64
PDF
5G Integrated Access and Backhaul
PDF
MobileNet - PR044
PPTX
PPTX
3 public key cryptography
PPT
Contiki IoT simulation
PPSX
Congestion control in TCP
PPTX
DPDK KNI interface
PPTX
ALOHA Protocol (in detail)
PPTX
PDF
Recurrent Neural Networks
PPT
IEEE 802.11s - Wireless Mesh Network
PPT
Transportlayer tanenbaum
Interview preparation workshop
Mobile signaling threats and vulnerabilities - real cases and statistics from...
Multicast routing protocols in adhoc networks
Memory management in linux
Adhoc and Sensor Networks - Chapter 02
Mobile transport layer - traditional TCP
Congetion Control.pptx
GCC for ARMv8 Aarch64
5G Integrated Access and Backhaul
MobileNet - PR044
3 public key cryptography
Contiki IoT simulation
Congestion control in TCP
DPDK KNI interface
ALOHA Protocol (in detail)
Recurrent Neural Networks
IEEE 802.11s - Wireless Mesh Network
Transportlayer tanenbaum
Ad

Similar to What multimodal foundation models cannot perceive (20)

PDF
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
PDF
“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Per...
PDF
Landscape of AI/ML in 2023
PDF
Grokked Transformers are Implicit Reasoners_ A Mechanistic Journey to the Edg...
PPTX
Grokked Transformers are Implicit Reasoners_ A Mechanistic Journey to the Edg...
PDF
One Perceptron to Rule Them All: Language and Vision
PPTX
Self-Supervised Learning recent trends 1 2
PDF
Transformer based approaches for visual representation learning
PDF
Brief History of Visual Representation Learning
PDF
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
PPTX
Transformers In Vision From Zero to Hero (DLI).pptx
PPTX
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
PDF
Vision and Language: Past, Present and Future
PDF
Learning object dynamics in video generation
PDF
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
PDF
“Optimized Vision Language Models for Intelligent Transportation System Appli...
PDF
Video+Language: From Classification to Description
PDF
Video + Language 2019
PPTX
Transformers in vision and its challenges and comparision with CNN
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
“Unveiling the Power of Multimodal Large Language Models: Revolutionizing Per...
Landscape of AI/ML in 2023
Grokked Transformers are Implicit Reasoners_ A Mechanistic Journey to the Edg...
Grokked Transformers are Implicit Reasoners_ A Mechanistic Journey to the Edg...
One Perceptron to Rule Them All: Language and Vision
Self-Supervised Learning recent trends 1 2
Transformer based approaches for visual representation learning
Brief History of Visual Representation Learning
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Transformers In Vision From Zero to Hero (DLI).pptx
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Vision and Language: Past, Present and Future
Learning object dynamics in video generation
“How Transformers Are Changing the Nature of Deep Learning Models,” a Present...
“Optimized Vision Language Models for Intelligent Transportation System Appli...
Video+Language: From Classification to Description
Video + Language 2019
Transformers in vision and its challenges and comparision with CNN
Ad

Recently uploaded (20)

PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
An interstellar mission to test astrophysical black holes
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPT
protein biochemistry.ppt for university classes
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
diccionario toefl examen de ingles para principiante
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
bbec55_b34400a7914c42429908233dbd381773.pdf
An interstellar mission to test astrophysical black holes
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
. Radiology Case Scenariosssssssssssssss
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
2. Earth - The Living Planet Module 2ELS
Introduction to Fisheries Biotechnology_Lesson 1.pptx
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
protein biochemistry.ppt for university classes
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
diccionario toefl examen de ingles para principiante
TOTAL hIP ARTHROPLASTY Presentation.pptx
Cell Membrane: Structure, Composition & Functions
INTRODUCTION TO EVS | Concept of sustainability
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...

What multimodal foundation models cannot perceive

  • 1. What multimodal foundation models cannot perceive Prof. dr. Cees Snoek University of Amsterdam Head of Video & Image Sense lab Scientific Director Amsterdam AI
  • 4. Human vision consumes 50% brain power Van Essen, Science 1992
  • 5. Human invention of written language Source: wikipedia
  • 6. Human invention of ChatGPT OpenAI, 11/2022
  • 7. Vision and language even more powerful 1. Collect millions of images and their description from the Internet 2. Learn associations between image and text 3. Generate text based on image, and vice versa CLIP, 7/2021
  • 8. What works well in vision and language? Flamingo, 11/2022
  • 9. What works well in vision and language? BLIP-2, 6/2023
  • 10. This talk Looks into what multimodal foundation models cannot perceive: 1. Scarcity 2. Space 3. Time 4. Human values
  • 11. 1. Scarcity Work in progress with Yunhua Zhang & Hazel Doughty.
  • 12. Low-Resource Natural Language Processing No previous works on low-resource vision tasks. Hedderich et al. ArXiv 2020
  • 17. Low-Resource Image Transfer Evaluation Task Formulation Train Val Test Circuit Diagram Classification Image Classification 154 100 1,078 Historic Map Retrieval Image-to-Image Retrieval 102 140 409 Mechanical Drawing Retrieval Image-to-Image Retrieval 300 100 754 Number of images (or image pairs) per split
  • 18. Poor performance for low-resource vision challenges 0 10 20 30 40 50 60 70 80 90 100 Mechanical Drawing Retrieval Historic map retrieval Circuit diagram classification SAM BLIP CLIP ImageBind
  • 19. Low-Resource Vision Challenges Our goal: adapt foundation models, pre-trained on large-scale datasets, to low-resource tasks. Challenge I: Data Scarcity Challenge II: Fine-Grained Challenge III: Specialized Domain Baseline I: Generated Data for Data Scarcity Baseline II: Tokenization for Fine-Grained Baseline III: Attention for Specialized Domains
  • 20. Baseline I: Generated Data for Data Scarcity We generate images close to the input image where the label is preserved as well as more diverse images which break the label. 𝐿!"#$ 𝐿%"&'%(&)'"$*+, Low-Resource Image Label-Preserving Augmentation Label-Breaking Augmentation Generative Model Generative Model 𝐿 = 𝐿!"#$ + 𝜆𝐿%"&'%(&)'"$*+,
  • 21. Circuit diagram examples FM Transmitter Label-Preserving Label-Breaking Original Image
  • 22. Baseline II: Tokenization for Fine-Grained As we have limited data we cannot train a tokenization layer from scratch Instead, we divide the linear projection kernel into sub-kernels for image patches. Then create patch-level features with a learned weighting ⋮ Original Kernel Sub-kernels ⋮ 𝐤! 𝐤" 𝐤# 𝑤! 𝑤" 𝑤$ Feature Tokens ⋮ Divide Divide
  • 23. Baseline III: Attention for Specialized Domains 1. Learn global attention maps with common patterns particular to the specialized domain 2. For each token, crop its region from the global attention map. 3. Combine with multi-head self- attention. Cropped Map Cropped Map Attention for Specialized Domain Feature Token
  • 24. Results of baselines for the three challenges Our baselines are effective
  • 25. Effective adapter for several foundation models 0 10 20 30 40 50 Our Baselines Zero-Shot Transfer CLIP Recall@1 ↑ 0 5 10 15 20 Our Baselines Zero-Shot Transfer BLIP 0 20 40 60 80 Our Baselines Zero-Shot Transfer ImageBind 0 5 10 15 Our Baselines Zero-Shot Transfer SAM Recall@1 ↑ Recall@1 ↑ Recall@1 ↑ Results for Historic Map Retrieval
  • 26. Qualitative results: easy samples We recognize prominent patterns in low-resource data, such as the coastline in the map of Sydney. Power Supply Dice Model Input Groundtruth Sydney, Australia Winnipeg, Canada Sydney, Australia Winnipeg, Canada
  • 27. Qualitative results: hard samples Our predictions are overconfident, often basing predictions on one key region such as the presence of the battery in the LED circuit. We cannot yet generalize to rare image styles such as used for the Innsbruck map Motor Driver Bell LED Audio Amplifier Innsbruck, Austria Innsbruck, Austria Cuneo, Italy Brugge, Belgium Leuven, Belgium Brugge, Belgium Model Input Prediction Groundtruth
  • 28. 2. Space Work in progress with Michael Dorkenwald, Nimrod Barazani & Yuki Asano.
  • 29. Special purpose object localization is very mature w/ Kien Nguyen et al. CVPR 2022 / ICLR 2024 w/ Aritra Bhowmik et al. ICCV 2023
  • 30. Can vision-language models localize objects?
  • 31. Perhaps we need another type of prompt?
  • 32. Can vision-language models do spatial reasoning?
  • 33. Our proposal Frozen VLM, e.g. Flamingo PIN: positional learnable prompt Synthetic images Synthetic, unlabeled images
  • 34. Data generation Zhao et al. X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion. ICML 2023
  • 35. Vanilla Flamingo Text Text Image Alayrac, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  • 36. 1) feed the frozen vision encoder with synthetic data Text Text
  • 37. 2) Provide VLM spatial learning capacity Text Text
  • 38. 3) Train pasted object locations via next-word prediction
  • 39. The PIN module unlocks spatial localisation
  • 40. The PIN module unlocks spatial localisation
  • 41. 3. Time With Piyush Bagad & Makarand Tapaswi. In CVPR 2023.
  • 42. • Foundation models: Language interface + a few (or no) training samples The problem What does this picture show? “A dog running”
  • 43. • Foundation models: Language interface + a few (or no) training samples • Particularly attractive for videos given high cost The problem What does this video show? “A kid eating ice-cream”
  • 44. • Do video foundation models truly understand time? The problem “A kid eating ice-cream” What does this video show?
  • 45. • Do video foundation models truly understand time? • Our idea for a “test of time”: ask questions that have temporal relations The problem “False” The baby eats ice-cream before walking down hill? True or False?
  • 46. • Synthetic benchmark • Simple ‘true’ or ‘false’ predictions The test of time
  • 47. • We pick a suite of seven openly available video-language models • While excelling at the control task, they all fail at the time-order task Existing models fail this test of time 0 20 40 60 80 100 Accuracy (%) CLIP4Clip CLIP2Video CenterCLIP VideoCLIP Frozen in Time VindLU BridgeFormer Chance Control task Time order task Chance
  • 48. How to instil this sense of time? • Post-pretraining: instead of training from scratch, we run another round of pre-training
  • 49. How to instil this sense of time? • Data: any dense video-captioning dataset!
  • 50. How to instil this sense of time? • Data: any dense video-captioning dataset!
  • 51. How to instil this sense of time? • Data: any dense video-captioning dataset!
  • 52. How to instil this sense of time? • Data: any dense video-captioning dataset!
  • 53. How to instil this sense of time? • Base model: We start with a pre-trained model: VideoCLIP Xu et al, VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, EMNLP 2021. [CLS] Baby eats ice-cream Video Encoder (BERT) Text Encoder (BERT) S3D features Mean Pooling Video representation Sentence representation
  • 54. How to instill this sense of time?
  • 55. How to instill this sense of time?
  • 58. 4. Human values Work in progress with the UvA Data Science Center HAVA-Lab.
  • 62. What defines human-aligned foundation models, how can they be made computable, and what determines their societal acceptance? How can we embed laws, societal values, and ethics into the foundation model lifecycle? Is there one solution for all, or do we need specialized algorithms for each domain? Cees Snoek Pascal Mettes Iris Groen Heleen Janssen Tobias Blanke Marie Lindegaard Erwin Berkhout Stevan Rudinac Marlies Schijven
  • 63. Conclusions Multimodal foundation models are amazing. But have perceptual difficulty with scarcity, space, time and human values. Synthetic data generation and small-capacity adapters may help. Bonus: both sustainable and responsible. Thank you
  • 64. Contact info Prof. dr. Cees Snoek https://ivi.fnwi.uva.nl/vislab/ @cgmsnoek {x, ellis.social}