論文紹介：Flamingo: a Visual Language Model for Few-Shot Learning

🦩 Flamingo: a Visual Language
Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi,
Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian
Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira,
Oriol Vinyals, Andrew Zisserman, Karén Simonyan, NeurIPS2022
2023/05/11

nFlamingo
• DeepMind Visual Language Model
• Large Language Model LLM
•
n
•
•
• 3B/9B/80B
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.

Visual Question Answering
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title
of this painting?
Answer: The
Hallucinogenic
Toreador.
Where is this
painting
displayed?
Answer: Louvres
Museum, Paris.
What is the name
of the city where
this was painted?
Answer:
! Arles.
Output:
"Underground"
Output:
"Congress"
Output: ! "Soulomes"
2+1=3 5+6=11 ! 3x6=18
Output: A
propaganda
poster depicting a
cat dressed as
French emperor
Napoleon holding
a piece of cheese.
Output: A pink
room with a
flamingo pool
float.
Output: !
A portrait of
Salvador Dali
with a robot
head.

Visual Dialog
What happens to
the man after
hitting the ball?
Answer:
! he falls down.
This is a picture of two teddy
bears on the moon.
What are they doing?
They are having a
conversation.
What object are they using?
It looks like a computer.
Is this surprising?
Yes, it is surprising.
Why is this picture surprising
to you?
I think it is surprising because
teddy bears are not usually
found on the moon.
What is the common thing
about these three images?
They are all flamingos.
What is the difference be-
tween these three images?
The first one is a cartoon, the
second one is a real flamingo,
and the third one is a 3D
model of a flamingo.
This is an apple with a sticker
on it.
What does the sticker say?
The sticker says "iPod".
Where is the photo taken?
It looks like it’s taken in a
backyard.
Do you think it is printed or
handwritten?
It looks like it’s handwritten.
What color is the sticker?
It’s white.
This is a cityscape. It looks
like Chicago.
What makes you think this is
Chicago?
I think it’s Chicago because
of the Shedd Aquarium in the
background.
What about this one? Which
city is this and what famous
landmark helped you recog-
nise the city?
This is Tokyo. I think it’s
Tokyo because of the Tokyo
Tower.
Figure 1: Selected examples of inputs and outputs obtained from Flamingo-80B. Flamingo

Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text
Figure 3: Flamingo architecture overview. Flamingo is a family of visual language models (VLMs)
that take as input visual data interleaved with text and produce free-form text as output.

Perceiver Resampler
Learned
latent
queries
Vision
Encoder
Vision
Encoder
Vision
Encoder
✕ num_layers
t=0 t=1 t=2 +
+
+
Xf
flatten
FFW
K=V=[Xf
,X]
Attention
Q=[X]
def perceiver_resampler(
x_f, # The [T, S, d] visual features (T=time, S=space)
time_embeddings, # The [T, 1, d] time pos embeddings.
x, # R learned latents of shape [R, d]
num_layers, # Number of layers
):
"""The Perceiver Resampler model."""
# Add the time position embeddings and flatten.
x_f = x_f + time_embeddings
x_f = flatten(x_f) # [T, S, d] -> [T * S, d]
# Apply the Perceiver Resampler layers.
for i in range(num_layers):
# Attention.
x = x + attention_i(q=x, kv=concat([x_f, x]))
# Feed forward.
x = x + ffw_i(x)
return x
Time
+
+
X
Vision Encoder
Learned latent queries
[Jaegle+, ICML2021]

GATED XATTN-DENSE
nLLM Visual input
•
self attention
FFW
Q=[Y]
FFW
+
+
tanh gating
+
+
tanh gating
GATED XATTN-DENSE
LM layer ❄
X
K=V=[X]
cross attention
K=V=[Y] Q=[Y]
❄
❄
Y
Language
input
def gated_xattn_dense(
y, # input language features
x, # input visual features
alpha_xattn, # xattn gating parameter – init at 0.
alpha_dense, # ffw gating parameter – init at 0.
):
"""Applies a GATED XATTN-DENSE layer."""
# 1. Gated Cross Attention
y = y + tanh(alpha_xattn) * attention(q=y, kv=x)
# 2. Gated Feed Forward (dense) Layer
y = y + tanh(alpha_dense) * ffw(y)
# Regular self-attention + FFW on language
y = y + frozen_attention(q=y, kv=y)
y = y + frozen_ffw(y)
return y # output visually informed language features
Vision
input
Y
X
…
…

Frozen
nVision Encoder
• Normalizer-Free ResNet [Brock, arXiv2021]
•
• ALIGN[Jia+, ICML2021]
• CLIP [Radford+, ICML2021]
nLLM
• Chinchilla
[Hoffmann+, arXiv2022]
• Transformer LLM
•
• Frozen 1.4/7/70B
Chinchilla
Vision
Encoder
❄
Perceiver
Resampler 1st LM block
a very serious cat.
Pretrained and frozen
❄
Vision
Encoder
❄
Perceiver
Resampler
1st GATED XATTN-DENSE
Interleaved visual/text data
This is a very cute dog. This is
Trained from scratch
n-th GATED XATTN-DENSE
n-th LM block ❄
…
Output: text
❄
<image> This is a very cute dog.<image> This is
Processed text

Flamingo
nFlamingo
• ALIGN 18
•
• M3W[Rae+, arXiv2021] 4300 HTML
• LTIP 1200
• VTP 2700
•
𝑥ℓ, 𝑦ℓ : ℓ /
𝜆": m 𝒟"

n
• VQA TextVQA [Singh+, CVPR2019], NextQA [Xiao+, CVPR2021]
• Visual Dialog VisDial [Das+, CVPR2017]
• Vision and text Classification HatefulMemes [Kiela+, NeurIPS2020]
• 11
n shot
• Flamingo 3B/9B/80B Zero-shot Few-shot
• Few-shot
• Fine-tuning
• Fine-tuning
•
Vision Encoder
Input Prompt Completion
This is a
chinchilla. They
are mainly found
in Chile.
This is a shiba.
They are very
popular in Japan.
This is !
a flamingo. They
are found in the
Caribbean and
South America.
What is the title Where is this
What is the name

forward MLP is 4D. L: number of layers, D: transformer hidden size, H: number of heads, Act.:
FFW activation, Sq. ReLU: Squared ReLU [104].

Zero/Few-shot
nZero/Few-shot SoTA
Method FT Shot
OKVQA
(I)
VQAv2
(I)
COCO
(I)
MSVDQA
(V)
VATEX
(V)
VizWiz
(I)
Flick30K
(I)
MSRVTTQA
(V)
iVQA
(V)
YouCook2
(V)
STAR
(V)
VisDial
(I)
TextVQA
(I)
NextQA
(I)
HatefulMemes
(I)
RareAct
(V)
Zero/Few
shot SOTA
7
(X)
[34]
43.3
(16)
[114]
38.2
(4)
[124]
32.2
(0)
[58]
35.2
(0)
- - -
[58]
19.2
(0)
[135]
12.2
(0)
-
[143]
39.4
(0)
[79]
11.6
(0)
- -
[85]
66.1
(0)
[85]
40.7
(0)
Flamingo-3B
7 0 41.2 49.2 73.0 27.5 40.1 28.9 60.6 11.0 32.7 55.8 39.6 46.1 30.1 21.3 53.7 58.4
7 4 43.3 53.2 85.0 33.0 50.0 34.0 72.0 14.9 35.7 64.6 41.3 47.3 32.7 22.4 53.6 -
7 32 45.9 57.1 99.0 42.6 59.2 45.5 71.2 25.6 37.7 76.7 41.6 47.3 30.6 26.1 56.3 -
Flamingo-9B
7 0 44.7 51.8 79.4 30.2 39.5 28.8 61.5 13.7 35.2 55.0 41.8 48.0 31.8 23.0 57.0 57.9
7 4 49.3 56.3 93.1 36.2 51.7 34.9 72.6 18.2 37.7 70.8 42.8 50.4 33.6 24.7 62.7 -
7 32 51.0 60.4 106.3 47.2 57.4 44.0 72.8 29.4 40.7 77.3 41.2 50.4 32.6 28.4 63.5 -
Flamingo
7 0 50.6 56.3 84.3 35.6 46.7 31.6 67.2 17.4 40.7 60.1 39.7 52.0 35.0 26.7 46.4 60.8
7 4 57.4 63.1 103.2 41.7 56.0 39.6 75.1 23.9 44.1 74.5 42.4 55.6 36.5 30.8 68.6 -
7 32 57.8 67.6 113.8 52.3 65.1 49.8 75.4 31.0 45.3 86.8 42.2 55.6 37.9 33.5 70.0 -
Pretrained
FT SOTA
4
(X)
54.4
[34]
(10K)
80.2
[140]
(444K)
143.3
[124]
(500K)
47.9
[28]
(27K)
76.3
[153]
(500K)
57.2
[65]
(20K)
67.4
[150]
(30K)
46.8
[51]
(130K)
35.4
[135]
(6K)
138.7
[132]
(10K)
36.7
[128]
(46K)
75.2
[79]
(123K)
54.7
[137]
(20K)
25.2
[129]
(38K)
79.1
[62]
(9K)
-
Table 1: Comparison to the state of the art. A single Flamingo model reaches the state of the art
on a wide array of image (I) and video (V) understanding tasks with few-shot learning, significantly

Few-shot
nFine-tuned models few-shot Flamingo
• 6 SoTA
Figure 2: Flamingo results overview. Left: Our largest model, dubbed Flamingo, outperforms

nFlamingo
• 80B Visual Language Model
• LLM Chinchilla
n
• Zero-shot Few-shot

論文紹介：Flamingo: a Visual Language Model for Few-Shot Learning

(a) Attention tanh gating (b) FFW tanh gating.
Figure 6: Evolution of the absolute value of the tanh gating at different layers of Flamingo-3B.
<BOS> Cute pics of my pets!<EOC><image>My puppy sitting in the grass. <EOC><image>My cat looking very dignified.<EOC>
Masked cross attention
<BOS>Cute pics of my pets!<EOC><image>My puppy sitting in the grass.<EOC><image> My cat looking very dignified.<EOC>
tokenization
Vision
Encoder
Perceiver
Resampler
Vision
Encoder
Perceiver
Resampler
K=V=[X]
Q
Image 1 Image 2
Processed text: <image> tags are inserted and special tokens are added
Cute pics of my pets!
My puppy sitting in the
grass.
My cat looking very
dignified.
Input webpage
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Figure 7: Interleaved visual data and text support. Given text interleaved with images/videos,
e.g. coming from a webpage, we first process the text by inserting <image> tags at the locations of
the visual data in the text as well as special tokens (<BOS> for “beginning of sequence” or <EOC> for
“end of chunk”). Images are processed independently by the Vision Encoder and Perceiver Resampler
to extract visual tokens. At a given text token, the model only cross-attends to the visual tokens
corresponding to the last preceding image/video. indicates which image/video a text token can
attend or 0 when no image/video is preceding. In practice, this selective cross-attention is achieved
through masking – illustrated here with the dark blue entries (unmasked/visible) and light blue entries

Fine-tuning
Method VQAV2 COCO VATEX VizWiz MSRVTTQA VisDial YouCook2 TextVQA HatefulMemes
test-dev test-std test test test-dev test-std test valid test-std valid valid test-std test seen
32 shots 67.6 - 113.8 65.1 49.8 - 31.0 56.8 - 86.8 36.0 - 70.0
Fine-tuned 82.0 82.1 138.1 84.2 65.7 65.4 47.4 61.8 59.7 118.6 57.1 54.1 86.6
81.3†
81.3†
149.6†
81.4†
57.2†
60.6†
46.8 75.2 75.4†
138.7 54.7 73.7 84.6†
SotA
[133] [133] [119] [153] [65] [65] [51] [79] [123] [132] [137] [84] [152]
Table 2: Comparison to SotA when fine-tuning Flamingo. We fine-tune Flamingo on all nine
tasks where Flamingo does not achieve SotA with few-shot learning. Flamingo sets a new SotA on
five of them, outperfoming methods (marked with †) that use tricks such as model ensembling or
domain-specific metric optimisation (e.g., CIDEr optimisation).
Ablated Flamingo-3B Changed Param. Step COCO OKVQA VQAv2 MSVDQA VATEX Overall
setting original value value count # time # CIDEr" top1" top1" top1" CIDEr" score"
Flamingo-3B model 3.2B 1.74s 86.5 42.1 55.8 36.3 53.4 70.7
(i) Training data All data
w/o Video-Text pairs 3.2B 1.42s 84.2 43.0 53.9 34.5 46.0 67.3
w/o Image-Text pairs 3.2B 0.95s 66.3 39.2 51.6 32.0 41.6 60.9
Image-Text pairs! LAION 3.2B 1.74s 79.5 41.4 53.5 33.9 47.6 66.4
w/o M3W 3.2B 1.02s 54.1 36.5 52.7 31.4 23.5 53.4
(ii) Optimisation Accumulation Round Robin 3.2B 1.68s 76.1 39.8 52.1 33.2 40.8 62.9
(iii) Tanh gating 3 7 3.2B 1.74s 78.4 40.5 52.9 35.9 47.5 66.5

論文紹介：Flamingo: a Visual Language Model for Few-Shot Learning

More Related Content

What's hot (20)

Similar to 論文紹介：Flamingo: a Visual Language Model for Few-Shot Learning (20)

More from Toru Tamaki (20)

Recently uploaded (20)

論文紹介：Flamingo: a Visual Language Model for Few-Shot Learning