論文紹介：ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

ProbVLM: Probabilistic Adapter for
Frozen Vison-Language Models
Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata,
ICCV2023
木全潤（名工大）
2023/10/30

概要
n決定的に事前学習されたVision-Languageモデルに
確率的な埋め込みを追加
• マルチモーダル埋め込みの不確実性を推定
• 下流タスクへの支援
• 潜在拡散モデルの埋め込みの可視化
babilistic Adapter for Frozen Vison-Language Models
1
Shyamgopal Karthik⇤,1
Massimiliano Mancini2
Zeynep Akata1,3
Tübingen 2
University of Trento 3
MPI for Intelligent Systems
ct
models (VLMs) like CLIP
between images and text.
stic mapping process, an
ped to a single vector in
problematic: as multiple
bstract the same concept

手法の目的
nCLIP [Radford+, ICML2021]などのVision-Languageモデル
• 画像とテキストを共有埋め込み空間に埋め込む
n提案手法
• アダプタを用いて埋め込まれた情報を確率的に再度埋め込み
• 出力の不確実性を定量化
babilistic Adapter for Frozen Vison-Language Models
1
Shyamgopal Karthik⇤,1
Massimiliano Mancini2
Zeynep Akata1,3
Tübingen 2
University of Trento 3
MPI for Intelligent Systems
ct
models (VLMs) like CLIP
between images and text.
stic mapping process, an
ped to a single vector in
problematic: as multiple
bstract the same concept

手法
Figure 2: Proposed framework (ProbVLM) takes an existing vision-language model and introduces a probabilistic adapter

手法
nIntra-modal Alignment
• ProbVLMの出力が元の埋め込みから
離れすぎないようにする
nCross-modal Alignment
• ProbVLMの出力がテキストと画像で
互い近くなるようにする
Figure 2: Proposed framework (ProbVLM) takes an existing vision-language model and introduces a probabilistic
over the image and text encoders. These adapters predict the parameters of a parameterized distribution for a given
ding. Models are trained by minimizing an objective consisting of intra/cross-modal supervision as detailed in Sectio
SLIP [51], Flava [74] and BLIP [45], are in frozen state, i.e.,
we have V (·; ✓⇤
V ) and T (·; ✓⇤
T ), where ✓⇤
V , ✓⇤
T represents
the parameters of the pretrained frozen encoders. These en-
coders are deterministic and map an image/text to vectors
in the shared space, i.e., given a sample image xV (and sim-
that learns to estimate the parameters {ẑ, ˆ
⌫...ˆ
⇢} w
help of frozen encoders V (·; ✓⇤
V ) and T (·; ✓⇤
T
functions V (·; ⇣V ) and T (·; ⇣T ) operate on im
text embeddings respectively, but during training de
both modalities, as discussed later. We design th

実験
nデータセット
• MS-COCO [Lin+, ECCV2014]
• Flickr-30k [Plummer+, CVPR2015]
• CUB [Wah+, Calteck2011]
• Oxford-Flowers 102 [Nilsback+,
ICVGIP2008]
n評価指標
• Recall@k
• Uncertainty level [chun+,
CVPR2021]の定義
nベースライン
• PFE [Shi and Jain, ICCV2019]から
適応
n学習率
• 1e-4
n学習エポック
• 100

較正された不確実性の計測
n不確実性が増加すると性能が低下
• S:R@kと不確実性レベルのスピアマン順位相関
• R^2:不確実性レベルとR@1の回帰適合
• -SR^2:これらの積
i2t t2i
VL M Metrics COCO Flickr FLO CUB COCO Flickr FLO CUB
CLIP
ProbVLM
S # -0.99 -0.70 -0.90 -0.60 -0.30 -0.70 -0.99 -0.89
R2
" 0.93 0.71 0.62 0.67 0.35 0.50 0.99 0.70
-SR2
" 0.93 0.49 0.56 0.40 0.10 0.35 0.99 0.63
PFE*[73]
S # -0.79 -0.19 0.60 -0.60 0.79 0.30 -0.89 -0.10
R2
" 0.59 0.01 0.30 0.28 0.74 0.44 0.52 0.00
-SR2
" 0.47 0.00 -0.18 0.17 -0.59 -0.13 0.47 -0.00
PCME*[10]
S # -0.89 -0.30 -0.30 -0.60 0.30 0.09 -0.70 0.30
R2
" 0.75 0.07 0.07 0.20 0.16 0.01 0.57 0.01
-SR2
" 0.68 0.02 0.02 0.12 -0.05 -0.00 0.40 -0.00
TTDA[2]
S # -0.79 -0.30 0.00 -0.60 -0.10 -0.19 -0.89 -0.50
R2
" 0.69 0.09 0.00 0.41 0.26 0.071 0.80 0.15
-SR2
" 0.55 0.03 0.00 0.24 0.00 0.01 0.73 0.07
BLIP
ProbVLM
S # -0.87 -0.79 -0.74 -0.66 -0.43 -0.38 -0.31 -0.22
R2
" 0.92 0.83 0.68 0.61 0.52 0.48 0.45 0.38
-SR2
" 0.80 0.66 0.50 0.40 0.22 0.18 0.14 0.08
PFE*[73]
S # -0.82 -0.74 -0.63 -0.63 -0.39 -0.32 -0.28 -0.18
R2
" 0.72 0.76 0.62 0.44 0.48 0.38 0.39 0.37
-SR2
" 0.58 0.57 0.39 0.27 0.19 0.12 0.11 0.07
PCME*[10]
S # -0.76 -0.53 -0.60 -0.44 -0.28 -0.26 -0.28 -0.21
R2
" 0.81 0.56 0.60 0.53 0.50 0.34 0.44 0.36
-SR2
" 0.62 0.29 0.36 0.23 0.14 0.09 0.12 0.08
TTDA[2]
S # -0.44 -0.33 -0.74 -0.60 -0.19 -0.26 -0.21 -0.21
R2
" 0.66 0.56 0.42 0.55 0.49 0.23 0.35 0.36
-SR2
" 0.29 0.18 0.31 0.33 0.10 0.06 0.07 0.08
Table 1: Metrics to evaluate the calibration of the uncer-

曖昧さの可視化
nCUBの画像を固定した予測埋込み分布
• CUBとCOCOのすべてのサンプルの尤度を計算
nProbVLM(左)とCLIP(右)の比較
• 重なっている部分によって曖昧さが捉えられている

Stable diffusionを用いた埋め込みの可視化
n方法
• キャプションの予測分布から埋め込みベクトルをサンプリング
• Stable diffusionモデルに通して可視化
n結果
• 平均に近いサンプルほど
生成画像に意味のある
バリエーション
• 離れすぎると強い
アーチファクト

まとめ
nProbVLMの提案
• 凍結された大規模な決定的VLMの埋め込み分布を推定
• 下流タスクへの有用性を示す
• 予測した埋め込み分布を拡散モデルで解釈する実験

論文紹介：ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

More Related Content

Similar to 論文紹介：ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models (20)

More from Toru Tamaki (20)

論文紹介：ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models