SlideShare a Scribd company logo
🦩 Flamingo: a Visual Language
Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi,
Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian
Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira,
Oriol Vinyals, Andrew Zisserman, Karén Simonyan, NeurIPS2022
2023/05/11
nFlamingo
• DeepMind Visual Language Model
• Large Language Model LLM
•
n
•
•
• 3B/9B/80B
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
Visual Question Answering
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title
of this painting?
Answer: The
Hallucinogenic
Toreador.
Where is this
painting
displayed?
Answer: Louvres
Museum, Paris.
What is the name
of the city where
this was painted?
Answer:
! Arles.
Output:
"Underground"
Output:
"Congress"
Output: ! "Soulomes"
2+1=3 5+6=11 ! 3x6=18
Output: A
propaganda
poster depicting a
cat dressed as
French emperor
Napoleon holding
a piece of cheese.
Output: A pink
room with a
flamingo pool
float.
Output: !
A portrait of
Salvador Dali
with a robot
head.
Visual Dialog
What happens to
the man after
hitting the ball?
Answer:
! he falls down.
This is a picture of two teddy
bears on the moon.
What are they doing?
They are having a
conversation.
What object are they using?
It looks like a computer.
Is this surprising?
Yes, it is surprising.
Why is this picture surprising
to you?
I think it is surprising because
teddy bears are not usually
found on the moon.
What is the common thing
about these three images?
They are all flamingos.
What is the difference be-
tween these three images?
The first one is a cartoon, the
second one is a real flamingo,
and the third one is a 3D
model of a flamingo.
This is an apple with a sticker
on it.
What does the sticker say?
The sticker says "iPod".
Where is the photo taken?
It looks like it’s taken in a
backyard.
Do you think it is printed or
handwritten?
It looks like it’s handwritten.
What color is the sticker?
It’s white.
This is a cityscape. It looks
like Chicago.
What makes you think this is
Chicago?
I think it’s Chicago because
of the Shedd Aquarium in the
background.
What about this one? Which
city is this and what famous
landmark helped you recog-
nise the city?
This is Tokyo. I think it’s
Tokyo because of the Tokyo
Tower.
Figure 1: Selected examples of inputs and outputs obtained from Flamingo-80B. Flamingo
Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text
Figure 3: Flamingo architecture overview. Flamingo is a family of visual language models (VLMs)
that take as input visual data interleaved with text and produce free-form text as output.
Perceiver Resampler
Learned
latent
queries
Vision
Encoder
Vision
Encoder
Vision
Encoder
✕ num_layers
t=0 t=1 t=2 +
+
+
Xf
flatten
FFW
K=V=[Xf
,X]
Attention
Q=[X]
def perceiver_resampler(
x_f, # The [T, S, d] visual features (T=time, S=space)
time_embeddings, # The [T, 1, d] time pos embeddings.
x, # R learned latents of shape [R, d]
num_layers, # Number of layers
):
"""The Perceiver Resampler model."""
# Add the time position embeddings and flatten.
x_f = x_f + time_embeddings
x_f = flatten(x_f) # [T, S, d] -> [T * S, d]
# Apply the Perceiver Resampler layers.
for i in range(num_layers):
# Attention.
x = x + attention_i(q=x, kv=concat([x_f, x]))
# Feed forward.
x = x + ffw_i(x)
return x
Time
+
+
X
Vision Encoder
Learned latent queries
[Jaegle+, ICML2021]
GATED XATTN-DENSE
nLLM Visual input
•
self attention
FFW
Q=[Y]
FFW
+
+
tanh gating
+
+
tanh gating
GATED XATTN-DENSE
LM layer ❄
X
K=V=[X]
cross attention
K=V=[Y] Q=[Y]
❄
❄
Y
Language
input
def gated_xattn_dense(
y, # input language features
x, # input visual features
alpha_xattn, # xattn gating parameter – init at 0.
alpha_dense, # ffw gating parameter – init at 0.
):
"""Applies a GATED XATTN-DENSE layer."""
# 1. Gated Cross Attention
y = y + tanh(alpha_xattn) * attention(q=y, kv=x)
# 2. Gated Feed Forward (dense) Layer
y = y + tanh(alpha_dense) * ffw(y)
# Regular self-attention + FFW on language
y = y + frozen_attention(q=y, kv=y)
y = y + frozen_ffw(y)
return y # output visually informed language features
Vision
input
Y
X
…
…
Frozen
nVision Encoder
• Normalizer-Free ResNet [Brock, arXiv2021]
•
• ALIGN[Jia+, ICML2021]
• CLIP [Radford+, ICML2021]
nLLM
• Chinchilla
[Hoffmann+, arXiv2022]
• Transformer LLM
•
• Frozen 1.4/7/70B
Chinchilla
Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text
Flamingo
nFlamingo
• ALIGN 18
•
• M3W[Rae+, arXiv2021] 4300 HTML
• LTIP 1200
• VTP 2700
•
𝑥ℓ, 𝑦ℓ : ℓ /
𝜆": m 𝒟"
n
• VQA TextVQA [Singh+, CVPR2019], NextQA [Xiao+, CVPR2021]
• Visual Dialog VisDial [Das+, CVPR2017]
• Vision and text Classification HatefulMemes [Kiela+, NeurIPS2020]
• 11
n shot
• Flamingo 3B/9B/80B Zero-shot Few-shot
• Few-shot
• Fine-tuning
• Fine-tuning
•
Vision Encoder
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title Where is this
What is the name
forward MLP is 4D. L: number of layers, D: transformer hidden size, H: number of heads, Act.:
FFW activation, Sq. ReLU: Squared ReLU [104].
Zero/Few-shot
nZero/Few-shot SoTA
Method FT Shot
OKVQA
(I)
VQAv2
(I)
COCO
(I)
MSVDQA
(V)
VATEX
(V)
VizWiz
(I)
Flick30K
(I)
MSRVTTQA
(V)
iVQA
(V)
YouCook2
(V)
STAR
(V)
VisDial
(I)
TextVQA
(I)
NextQA
(I)
HatefulMemes
(I)
RareAct
(V)
Zero/Few
shot SOTA
7
(X)
[34]
43.3
(16)
[114]
38.2
(4)
[124]
32.2
(0)
[58]
35.2
(0)
- - -
[58]
19.2
(0)
[135]
12.2
(0)
-
[143]
39.4
(0)
[79]
11.6
(0)
- -
[85]
66.1
(0)
[85]
40.7
(0)
Flamingo-3B
7 0 41.2 49.2 73.0 27.5 40.1 28.9 60.6 11.0 32.7 55.8 39.6 46.1 30.1 21.3 53.7 58.4
7 4 43.3 53.2 85.0 33.0 50.0 34.0 72.0 14.9 35.7 64.6 41.3 47.3 32.7 22.4 53.6 -
7 32 45.9 57.1 99.0 42.6 59.2 45.5 71.2 25.6 37.7 76.7 41.6 47.3 30.6 26.1 56.3 -
Flamingo-9B
7 0 44.7 51.8 79.4 30.2 39.5 28.8 61.5 13.7 35.2 55.0 41.8 48.0 31.8 23.0 57.0 57.9
7 4 49.3 56.3 93.1 36.2 51.7 34.9 72.6 18.2 37.7 70.8 42.8 50.4 33.6 24.7 62.7 -
7 32 51.0 60.4 106.3 47.2 57.4 44.0 72.8 29.4 40.7 77.3 41.2 50.4 32.6 28.4 63.5 -
Flamingo
7 0 50.6 56.3 84.3 35.6 46.7 31.6 67.2 17.4 40.7 60.1 39.7 52.0 35.0 26.7 46.4 60.8
7 4 57.4 63.1 103.2 41.7 56.0 39.6 75.1 23.9 44.1 74.5 42.4 55.6 36.5 30.8 68.6 -
7 32 57.8 67.6 113.8 52.3 65.1 49.8 75.4 31.0 45.3 86.8 42.2 55.6 37.9 33.5 70.0 -
Pretrained
FT SOTA
4
(X)
54.4
[34]
(10K)
80.2
[140]
(444K)
143.3
[124]
(500K)
47.9
[28]
(27K)
76.3
[153]
(500K)
57.2
[65]
(20K)
67.4
[150]
(30K)
46.8
[51]
(130K)
35.4
[135]
(6K)
138.7
[132]
(10K)
36.7
[128]
(46K)
75.2
[79]
(123K)
54.7
[137]
(20K)
25.2
[129]
(38K)
79.1
[62]
(9K)
-
Table 1: Comparison to the state of the art. A single Flamingo model reaches the state of the art
on a wide array of image (I) and video (V) understanding tasks with few-shot learning, significantly
Few-shot
nFine-tuned models few-shot Flamingo
• 6 SoTA
Figure 2: Flamingo results overview. Left: Our largest model, dubbed Flamingo, outperforms
nFlamingo
• 80B Visual Language Model
• LLM Chinchilla
n
• Zero-shot Few-shot
論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning
(a) Attention tanh gating (b) FFW tanh gating.
Figure 6: Evolution of the absolute value of the tanh gating at different layers of Flamingo-3B.
<BOS> Cute pics of my pets!<EOC><image>My puppy sitting in the grass. <EOC><image>My cat looking very dignified.<EOC>
Masked cross attention
<BOS>Cute pics of my pets!<EOC><image>My puppy sitting in the grass.<EOC><image> My cat looking very dignified.<EOC>
tokenization
Vision
Encoder
Perceiver
Resampler
Vision
Encoder
Perceiver
Resampler
K=V=[X]
Q
Image 1 Image 2
Processed text: <image> tags are inserted and special tokens are added
Cute pics of my pets!
My puppy sitting in the
grass.
My cat looking very
dignified.
Input webpage
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Figure 7: Interleaved visual data and text support. Given text interleaved with images/videos,
e.g. coming from a webpage, we first process the text by inserting <image> tags at the locations of
the visual data in the text as well as special tokens (<BOS> for “beginning of sequence” or <EOC> for
“end of chunk”). Images are processed independently by the Vision Encoder and Perceiver Resampler
to extract visual tokens. At a given text token, the model only cross-attends to the visual tokens
corresponding to the last preceding image/video. indicates which image/video a text token can
attend or 0 when no image/video is preceding. In practice, this selective cross-attention is achieved
through masking – illustrated here with the dark blue entries (unmasked/visible) and light blue entries
Fine-tuning
Method VQAV2 COCO VATEX VizWiz MSRVTTQA VisDial YouCook2 TextVQA HatefulMemes
test-dev test-std test test test-dev test-std test valid test-std valid valid test-std test seen
32 shots 67.6 - 113.8 65.1 49.8 - 31.0 56.8 - 86.8 36.0 - 70.0
Fine-tuned 82.0 82.1 138.1 84.2 65.7 65.4 47.4 61.8 59.7 118.6 57.1 54.1 86.6
81.3†
81.3†
149.6†
81.4†
57.2†
60.6†
46.8 75.2 75.4†
138.7 54.7 73.7 84.6†
SotA
[133] [133] [119] [153] [65] [65] [51] [79] [123] [132] [137] [84] [152]
Table 2: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine
tasks where Flamingo does not achieve SotA with few-shot learning. Flamingo sets a new SotA on
five of them, outperfoming methods (marked with †) that use tricks such as model ensembling or
domain-specific metric optimisation (e.g., CIDEr optimisation).
Ablated Flamingo-3B Changed Param. Step COCO OKVQA VQAv2 MSVDQA VATEX Overall
setting original value value count # time # CIDEr" top1" top1" top1" CIDEr" score"
Flamingo-3B model 3.2B 1.74s 86.5 42.1 55.8 36.3 53.4 70.7
(i) Training data All data
w/o Video-Text pairs 3.2B 1.42s 84.2 43.0 53.9 34.5 46.0 67.3
w/o Image-Text pairs 3.2B 0.95s 66.3 39.2 51.6 32.0 41.6 60.9
Image-Text pairs! LAION 3.2B 1.74s 79.5 41.4 53.5 33.9 47.6 66.4
w/o M3W 3.2B 1.02s 54.1 36.5 52.7 31.4 23.5 53.4
(ii) Optimisation Accumulation Round Robin 3.2B 1.68s 76.1 39.8 52.1 33.2 40.8 62.9
(iii) Tanh gating 3 7 3.2B 1.74s 78.4 40.5 52.9 35.9 47.5 66.5

More Related Content

PPTX
[DL輪読会]DropBlock: A regularization method for convolutional networks
PDF
Prophet入門【Python編】Facebookの時系列予測ツール
PDF
論文紹介:DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object D...
PDF
20190619 オートエンコーダーと異常検知入門
PPTX
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
PPTX
forestFloorパッケージを使ったrandomForestの感度分析
PDF
いろんなバンディットアルゴリズムを理解しよう
PPTX
画像処理AIを用いた異常検知
[DL輪読会]DropBlock: A regularization method for convolutional networks
Prophet入門【Python編】Facebookの時系列予測ツール
論文紹介:DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object D...
20190619 オートエンコーダーと異常検知入門
【DL輪読会】Flamingo: a Visual Language Model for Few-Shot Learning 画像×言語の大規模基盤モ...
forestFloorパッケージを使ったrandomForestの感度分析
いろんなバンディットアルゴリズムを理解しよう
画像処理AIを用いた異常検知

What's hot (20)

PDF
Active Learning 入門
PDF
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
PDF
NIP2015読み会「End-To-End Memory Networks」
PDF
はじめてのパターン認識8章サポートベクトルマシン
PDF
LDA入門
PDF
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
PDF
[DL輪読会]Deep Reinforcement Learning that Matters
PPTX
【DL輪読会】Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Mo...
PDF
サポートベクトルデータ記述法による異常検知 in 機械学習プロフェッショナルシリーズ輪読会
PDF
自然言語処理における深層学習を用いた予測の不確実性 - Predictive Uncertainty in NLP -
PDF
Teslaにおけるコンピュータビジョン技術の調査 (2)
PDF
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
PPTX
[DL輪読会]A closer look at few shot classification
PDF
SAT/SMTソルバの仕組み
PDF
機械学習モデルの判断根拠の説明
PDF
詳説word2vec
PPTX
[DL輪読会]Focal Loss for Dense Object Detection
PDF
One Class SVMを用いた異常値検知
PDF
特徴選択のためのLasso解列挙
Active Learning 入門
SSII2022 [TS1] Transformerの最前線〜 畳込みニューラルネットワークの先へ 〜
NIP2015読み会「End-To-End Memory Networks」
はじめてのパターン認識8章サポートベクトルマシン
LDA入門
SSII2021 [OS2-03] 自己教師あり学習における対照学習の基礎と応用
[DL輪読会]Deep Reinforcement Learning that Matters
【DL輪読会】Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Mo...
サポートベクトルデータ記述法による異常検知 in 機械学習プロフェッショナルシリーズ輪読会
自然言語処理における深層学習を用いた予測の不確実性 - Predictive Uncertainty in NLP -
Teslaにおけるコンピュータビジョン技術の調査 (2)
【DL輪読会】"Masked Siamese Networks for Label-Efficient Learning"
[DL輪読会]A closer look at few shot classification
SAT/SMTソルバの仕組み
機械学習モデルの判断根拠の説明
詳説word2vec
[DL輪読会]Focal Loss for Dense Object Detection
One Class SVMを用いた異常値検知
特徴選択のためのLasso解列挙
Ad

Similar to 論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning (20)

PPTX
A Bizarre Way to do Real-Time Lighting
PDF
[系列活動] 一日搞懂生成式對抗網路
KEY
Starling Framework
PDF
GameProgramming for college students DMAD
PDF
what engineers don't know (but probably mathematicians do)
PDF
stackconf 2022: Are all programming languages in english?
PPTX
March.2012.KinectForWindows
PDF
Applying your Convolutional Neural Networks
PDF
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
PDF
TRICK 2018 results
PPTX
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
PDF
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
PDF
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
DOCX
มโนท ศน เทคโนโลย_ทางการศ_กษา
DOCX
มโนท ศน เทคโนโลย_ทางการศ_กษา
DOCX
มโนทัศน์เทคโนโลยีทางการศึกษา
PPTX
A good tutorial about Deep Learning methods
PPTX
RNN is recurrent neural networks and deep learning
PPT
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
PDF
Dynamic Wounds on Animated Characters in UE4
A Bizarre Way to do Real-Time Lighting
[系列活動] 一日搞懂生成式對抗網路
Starling Framework
GameProgramming for college students DMAD
what engineers don't know (but probably mathematicians do)
stackconf 2022: Are all programming languages in english?
March.2012.KinectForWindows
Applying your Convolutional Neural Networks
Blockchain Technology - Week 6 - Role of Cryptography in Blockchain
TRICK 2018 results
StoryVisualization using StoryGAN implemented by pytorch on Pororro Dataset​....
BlueHat Seattle 2019 || Modern Binary Analysis with ILs
Deep Learning with Audio Signals: Prepare, Process, Design, Expect
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนทัศน์เทคโนโลยีทางการศึกษา
A good tutorial about Deep Learning methods
RNN is recurrent neural networks and deep learning
Ilfak Guilfanov - Decompiler internals: Microcode [rooted2018]
Dynamic Wounds on Animated Characters in UE4
Ad

More from Toru Tamaki (20)

PDF
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
PDF
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
PDF
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
PDF
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
PDF
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
PDF
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
PDF
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
PDF
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
PDF
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
PDF
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
PDF
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
PDF
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
PDF
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
PDF
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PDF
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
PDF
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
PDF
論文紹介:On Feature Normalization and Data Augmentation
PDF
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
PDF
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
PDF
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
論文紹介:On Feature Normalization and Data Augmentation
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

論文紹介:Flamingo: a Visual Language Model for Few-Shot Learning

  • 1. 🦩 Flamingo: a Visual Language Model for Few-Shot Learning Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karén Simonyan, NeurIPS2022 2023/05/11
  • 2. nFlamingo • DeepMind Visual Language Model • Large Language Model LLM • n • • • 3B/9B/80B Input Prompt Completion This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America.
  • 3. Visual Question Answering This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America. What is the title of this painting? Answer: The Hallucinogenic Toreador. Where is this painting displayed? Answer: Louvres Museum, Paris. What is the name of the city where this was painted? Answer: ! Arles. Output: "Underground" Output: "Congress" Output: ! "Soulomes" 2+1=3 5+6=11 ! 3x6=18 Output: A propaganda poster depicting a cat dressed as French emperor Napoleon holding a piece of cheese. Output: A pink room with a flamingo pool float. Output: ! A portrait of Salvador Dali with a robot head.
  • 4. Visual Dialog What happens to the man after hitting the ball? Answer: ! he falls down. This is a picture of two teddy bears on the moon. What are they doing? They are having a conversation. What object are they using? It looks like a computer. Is this surprising? Yes, it is surprising. Why is this picture surprising to you? I think it is surprising because teddy bears are not usually found on the moon. What is the common thing about these three images? They are all flamingos. What is the difference be- tween these three images? The first one is a cartoon, the second one is a real flamingo, and the third one is a 3D model of a flamingo. This is an apple with a sticker on it. What does the sticker say? The sticker says "iPod". Where is the photo taken? It looks like it’s taken in a backyard. Do you think it is printed or handwritten? It looks like it’s handwritten. What color is the sticker? It’s white. This is a cityscape. It looks like Chicago. What makes you think this is Chicago? I think it’s Chicago because of the Shedd Aquarium in the background. What about this one? Which city is this and what famous landmark helped you recog- nise the city? This is Tokyo. I think it’s Tokyo because of the Tokyo Tower. Figure 1: Selected examples of inputs and outputs obtained from Flamingo-80B. Flamingo
  • 5. Vision Encoder ❄ Perceiver Resampler 1st LM block a very serious cat. Pretrained and frozen ❄ Vision Encoder ❄ Perceiver Resampler 1st GATED XATTN-DENSE Interleaved visual/text data This is a very cute dog. This is Trained from scratch n-th GATED XATTN-DENSE n-th LM block ❄ … Output: text ❄ <image> This is a very cute dog.<image> This is Processed text Figure 3: Flamingo architecture overview. Flamingo is a family of visual language models (VLMs) that take as input visual data interleaved with text and produce free-form text as output.
  • 6. Perceiver Resampler Learned latent queries Vision Encoder Vision Encoder Vision Encoder ✕ num_layers t=0 t=1 t=2 + + + Xf flatten FFW K=V=[Xf ,X] Attention Q=[X] def perceiver_resampler( x_f, # The [T, S, d] visual features (T=time, S=space) time_embeddings, # The [T, 1, d] time pos embeddings. x, # R learned latents of shape [R, d] num_layers, # Number of layers ): """The Perceiver Resampler model.""" # Add the time position embeddings and flatten. x_f = x_f + time_embeddings x_f = flatten(x_f) # [T, S, d] -> [T * S, d] # Apply the Perceiver Resampler layers. for i in range(num_layers): # Attention. x = x + attention_i(q=x, kv=concat([x_f, x])) # Feed forward. x = x + ffw_i(x) return x Time + + X Vision Encoder Learned latent queries [Jaegle+, ICML2021]
  • 7. GATED XATTN-DENSE nLLM Visual input • self attention FFW Q=[Y] FFW + + tanh gating + + tanh gating GATED XATTN-DENSE LM layer ❄ X K=V=[X] cross attention K=V=[Y] Q=[Y] ❄ ❄ Y Language input def gated_xattn_dense( y, # input language features x, # input visual features alpha_xattn, # xattn gating parameter – init at 0. alpha_dense, # ffw gating parameter – init at 0. ): """Applies a GATED XATTN-DENSE layer.""" # 1. Gated Cross Attention y = y + tanh(alpha_xattn) * attention(q=y, kv=x) # 2. Gated Feed Forward (dense) Layer y = y + tanh(alpha_dense) * ffw(y) # Regular self-attention + FFW on language y = y + frozen_attention(q=y, kv=y) y = y + frozen_ffw(y) return y # output visually informed language features Vision input Y X … …
  • 8. Frozen nVision Encoder • Normalizer-Free ResNet [Brock, arXiv2021] • • ALIGN[Jia+, ICML2021] • CLIP [Radford+, ICML2021] nLLM • Chinchilla [Hoffmann+, arXiv2022] • Transformer LLM • • Frozen 1.4/7/70B Chinchilla Vision Encoder ❄ Perceiver Resampler 1st LM block a very serious cat. Pretrained and frozen ❄ Vision Encoder ❄ Perceiver Resampler 1st GATED XATTN-DENSE Interleaved visual/text data This is a very cute dog. This is Trained from scratch n-th GATED XATTN-DENSE n-th LM block ❄ … Output: text ❄ <image> This is a very cute dog.<image> This is Processed text
  • 9. Flamingo nFlamingo • ALIGN 18 • • M3W[Rae+, arXiv2021] 4300 HTML • LTIP 1200 • VTP 2700 • 𝑥ℓ, 𝑦ℓ : ℓ / 𝜆": m 𝒟"
  • 10. n • VQA TextVQA [Singh+, CVPR2019], NextQA [Xiao+, CVPR2021] • Visual Dialog VisDial [Das+, CVPR2017] • Vision and text Classification HatefulMemes [Kiela+, NeurIPS2020] • 11 n shot • Flamingo 3B/9B/80B Zero-shot Few-shot • Few-shot • Fine-tuning • Fine-tuning • Vision Encoder Input Prompt Completion This is a chinchilla. They are mainly found in Chile. This is a shiba. They are very popular in Japan. This is ! a flamingo. They are found in the Caribbean and South America. What is the title Where is this What is the name
  • 11. forward MLP is 4D. L: number of layers, D: transformer hidden size, H: number of heads, Act.: FFW activation, Sq. ReLU: Squared ReLU [104].
  • 12. Zero/Few-shot nZero/Few-shot SoTA Method FT Shot OKVQA (I) VQAv2 (I) COCO (I) MSVDQA (V) VATEX (V) VizWiz (I) Flick30K (I) MSRVTTQA (V) iVQA (V) YouCook2 (V) STAR (V) VisDial (I) TextVQA (I) NextQA (I) HatefulMemes (I) RareAct (V) Zero/Few shot SOTA 7 (X) [34] 43.3 (16) [114] 38.2 (4) [124] 32.2 (0) [58] 35.2 (0) - - - [58] 19.2 (0) [135] 12.2 (0) - [143] 39.4 (0) [79] 11.6 (0) - - [85] 66.1 (0) [85] 40.7 (0) Flamingo-3B 7 0 41.2 49.2 73.0 27.5 40.1 28.9 60.6 11.0 32.7 55.8 39.6 46.1 30.1 21.3 53.7 58.4 7 4 43.3 53.2 85.0 33.0 50.0 34.0 72.0 14.9 35.7 64.6 41.3 47.3 32.7 22.4 53.6 - 7 32 45.9 57.1 99.0 42.6 59.2 45.5 71.2 25.6 37.7 76.7 41.6 47.3 30.6 26.1 56.3 - Flamingo-9B 7 0 44.7 51.8 79.4 30.2 39.5 28.8 61.5 13.7 35.2 55.0 41.8 48.0 31.8 23.0 57.0 57.9 7 4 49.3 56.3 93.1 36.2 51.7 34.9 72.6 18.2 37.7 70.8 42.8 50.4 33.6 24.7 62.7 - 7 32 51.0 60.4 106.3 47.2 57.4 44.0 72.8 29.4 40.7 77.3 41.2 50.4 32.6 28.4 63.5 - Flamingo 7 0 50.6 56.3 84.3 35.6 46.7 31.6 67.2 17.4 40.7 60.1 39.7 52.0 35.0 26.7 46.4 60.8 7 4 57.4 63.1 103.2 41.7 56.0 39.6 75.1 23.9 44.1 74.5 42.4 55.6 36.5 30.8 68.6 - 7 32 57.8 67.6 113.8 52.3 65.1 49.8 75.4 31.0 45.3 86.8 42.2 55.6 37.9 33.5 70.0 - Pretrained FT SOTA 4 (X) 54.4 [34] (10K) 80.2 [140] (444K) 143.3 [124] (500K) 47.9 [28] (27K) 76.3 [153] (500K) 57.2 [65] (20K) 67.4 [150] (30K) 46.8 [51] (130K) 35.4 [135] (6K) 138.7 [132] (10K) 36.7 [128] (46K) 75.2 [79] (123K) 54.7 [137] (20K) 25.2 [129] (38K) 79.1 [62] (9K) - Table 1: Comparison to the state of the art. A single Flamingo model reaches the state of the art on a wide array of image (I) and video (V) understanding tasks with few-shot learning, significantly
  • 13. Few-shot nFine-tuned models few-shot Flamingo • 6 SoTA Figure 2: Flamingo results overview. Left: Our largest model, dubbed Flamingo, outperforms
  • 14. nFlamingo • 80B Visual Language Model • LLM Chinchilla n • Zero-shot Few-shot
  • 16. (a) Attention tanh gating (b) FFW tanh gating. Figure 6: Evolution of the absolute value of the tanh gating at different layers of Flamingo-3B. <BOS> Cute pics of my pets!<EOC><image>My puppy sitting in the grass. <EOC><image>My cat looking very dignified.<EOC> Masked cross attention <BOS>Cute pics of my pets!<EOC><image>My puppy sitting in the grass.<EOC><image> My cat looking very dignified.<EOC> tokenization Vision Encoder Perceiver Resampler Vision Encoder Perceiver Resampler K=V=[X] Q Image 1 Image 2 Processed text: <image> tags are inserted and special tokens are added Cute pics of my pets! My puppy sitting in the grass. My cat looking very dignified. Input webpage 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Figure 7: Interleaved visual data and text support. Given text interleaved with images/videos, e.g. coming from a webpage, we first process the text by inserting <image> tags at the locations of the visual data in the text as well as special tokens (<BOS> for “beginning of sequence” or <EOC> for “end of chunk”). Images are processed independently by the Vision Encoder and Perceiver Resampler to extract visual tokens. At a given text token, the model only cross-attends to the visual tokens corresponding to the last preceding image/video. indicates which image/video a text token can attend or 0 when no image/video is preceding. In practice, this selective cross-attention is achieved through masking – illustrated here with the dark blue entries (unmasked/visible) and light blue entries
  • 17. Fine-tuning Method VQAV2 COCO VATEX VizWiz MSRVTTQA VisDial YouCook2 TextVQA HatefulMemes test-dev test-std test test test-dev test-std test valid test-std valid valid test-std test seen 32 shots 67.6 - 113.8 65.1 49.8 - 31.0 56.8 - 86.8 36.0 - 70.0 Fine-tuned 82.0 82.1 138.1 84.2 65.7 65.4 47.4 61.8 59.7 118.6 57.1 54.1 86.6 81.3† 81.3† 149.6† 81.4† 57.2† 60.6† 46.8 75.2 75.4† 138.7 54.7 73.7 84.6† SotA [133] [133] [119] [153] [65] [65] [51] [79] [123] [132] [137] [84] [152] Table 2: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine tasks where Flamingo does not achieve SotA with few-shot learning. Flamingo sets a new SotA on five of them, outperfoming methods (marked with †) that use tricks such as model ensembling or domain-specific metric optimisation (e.g., CIDEr optimisation). Ablated Flamingo-3B Changed Param. Step COCO OKVQA VQAv2 MSVDQA VATEX Overall setting original value value count # time # CIDEr" top1" top1" top1" CIDEr" score" Flamingo-3B model 3.2B 1.74s 86.5 42.1 55.8 36.3 53.4 70.7 (i) Training data All data w/o Video-Text pairs 3.2B 1.42s 84.2 43.0 53.9 34.5 46.0 67.3 w/o Image-Text pairs 3.2B 0.95s 66.3 39.2 51.6 32.0 41.6 60.9 Image-Text pairs! LAION 3.2B 1.74s 79.5 41.4 53.5 33.9 47.6 66.4 w/o M3W 3.2B 1.02s 54.1 36.5 52.7 31.4 23.5 53.4 (ii) Optimisation Accumulation Round Robin 3.2B 1.68s 76.1 39.8 52.1 33.2 40.8 62.9 (iii) Tanh gating 3 7 3.2B 1.74s 78.4 40.5 52.9 35.9 47.5 66.5