SlideShare a Scribd company logo
2
Most read
4
Most read
10
Most read
ProbVLM: Probabilistic Adapter for
Frozen Vison-Language Models
Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata,
ICCV2023
木全潤(名工大)
2023/10/30
概要
n決定的に事前学習されたVision-Languageモデルに
確率的な埋め込みを追加
• マルチモーダル埋め込みの不確実性を推定
• 下流タスクへの支援
• 潜在拡散モデルの埋め込みの可視化
babilistic Adapter for Frozen Vison-Language Models
1
Shyamgopal Karthik⇤,1
Massimiliano Mancini2
Zeynep Akata1,3
Tübingen 2
University of Trento 3
MPI for Intelligent Systems
ct
models (VLMs) like CLIP
between images and text.
stic mapping process, an
ped to a single vector in
problematic: as multiple
bstract the same concept
手法の目的
nCLIP [Radford+, ICML2021]などのVision-Languageモデル
• 画像とテキストを共有埋め込み空間に埋め込む
n提案手法
• アダプタを用いて埋め込まれた情報を確率的に再度埋め込み
• 出力の不確実性を定量化
babilistic Adapter for Frozen Vison-Language Models
1
Shyamgopal Karthik⇤,1
Massimiliano Mancini2
Zeynep Akata1,3
Tübingen 2
University of Trento 3
MPI for Intelligent Systems
ct
models (VLMs) like CLIP
between images and text.
stic mapping process, an
ped to a single vector in
problematic: as multiple
bstract the same concept
手法
Figure 2: Proposed framework (ProbVLM) takes an existing vision-language model and introduces a probabilistic adapter
手法
nIntra-modal Alignment
• ProbVLMの出力が元の埋め込みから
離れすぎないようにする
nCross-modal Alignment
• ProbVLMの出力がテキストと画像で
互い近くなるようにする
Figure 2: Proposed framework (ProbVLM) takes an existing vision-language model and introduces a probabilistic
over the image and text encoders. These adapters predict the parameters of a parameterized distribution for a given
ding. Models are trained by minimizing an objective consisting of intra/cross-modal supervision as detailed in Sectio
SLIP [51], Flava [74] and BLIP [45], are in frozen state, i.e.,
we have V (·; ✓⇤
V ) and T (·; ✓⇤
T ), where ✓⇤
V , ✓⇤
T represents
the parameters of the pretrained frozen encoders. These en-
coders are deterministic and map an image/text to vectors
in the shared space, i.e., given a sample image xV (and sim-
that learns to estimate the parameters {ẑ, ˆ
⌫...ˆ
⇢} w
help of frozen encoders V (·; ✓⇤
V ) and T (·; ✓⇤
T
functions V (·; ⇣V ) and T (·; ⇣T ) operate on im
text embeddings respectively, but during training de
both modalities, as discussed later. We design th
実験
nデータセット
• MS-COCO [Lin+, ECCV2014]
• Flickr-30k [Plummer+, CVPR2015]
• CUB [Wah+, Calteck2011]
• Oxford-Flowers 102 [Nilsback+,
ICVGIP2008]
n評価指標
• Recall@k
• Uncertainty level [chun+,
CVPR2021]の定義
nベースライン
• PFE [Shi and Jain, ICCV2019]から
適応
n学習率
• 1e-4
n学習エポック
• 100
較正された不確実性の計測
n不確実性が増加すると性能が低下
• S:R@kと不確実性レベルのスピアマン順位相関
• R^2:不確実性レベルとR@1の回帰適合
• -SR^2:これらの積
i2t t2i
VL M Metrics COCO Flickr FLO CUB COCO Flickr FLO CUB
CLIP
ProbVLM
S # -0.99 -0.70 -0.90 -0.60 -0.30 -0.70 -0.99 -0.89
R2
" 0.93 0.71 0.62 0.67 0.35 0.50 0.99 0.70
-SR2
" 0.93 0.49 0.56 0.40 0.10 0.35 0.99 0.63
PFE*[73]
S # -0.79 -0.19 0.60 -0.60 0.79 0.30 -0.89 -0.10
R2
" 0.59 0.01 0.30 0.28 0.74 0.44 0.52 0.00
-SR2
" 0.47 0.00 -0.18 0.17 -0.59 -0.13 0.47 -0.00
PCME*[10]
S # -0.89 -0.30 -0.30 -0.60 0.30 0.09 -0.70 0.30
R2
" 0.75 0.07 0.07 0.20 0.16 0.01 0.57 0.01
-SR2
" 0.68 0.02 0.02 0.12 -0.05 -0.00 0.40 -0.00
TTDA[2]
S # -0.79 -0.30 0.00 -0.60 -0.10 -0.19 -0.89 -0.50
R2
" 0.69 0.09 0.00 0.41 0.26 0.071 0.80 0.15
-SR2
" 0.55 0.03 0.00 0.24 0.00 0.01 0.73 0.07
BLIP
ProbVLM
S # -0.87 -0.79 -0.74 -0.66 -0.43 -0.38 -0.31 -0.22
R2
" 0.92 0.83 0.68 0.61 0.52 0.48 0.45 0.38
-SR2
" 0.80 0.66 0.50 0.40 0.22 0.18 0.14 0.08
PFE*[73]
S # -0.82 -0.74 -0.63 -0.63 -0.39 -0.32 -0.28 -0.18
R2
" 0.72 0.76 0.62 0.44 0.48 0.38 0.39 0.37
-SR2
" 0.58 0.57 0.39 0.27 0.19 0.12 0.11 0.07
PCME*[10]
S # -0.76 -0.53 -0.60 -0.44 -0.28 -0.26 -0.28 -0.21
R2
" 0.81 0.56 0.60 0.53 0.50 0.34 0.44 0.36
-SR2
" 0.62 0.29 0.36 0.23 0.14 0.09 0.12 0.08
TTDA[2]
S # -0.44 -0.33 -0.74 -0.60 -0.19 -0.26 -0.21 -0.21
R2
" 0.66 0.56 0.42 0.55 0.49 0.23 0.35 0.36
-SR2
" 0.29 0.18 0.31 0.33 0.10 0.06 0.07 0.08
Table 1: Metrics to evaluate the calibration of the uncer-
曖昧さの可視化
nCUBの画像を固定した予測埋込み分布
• CUBとCOCOのすべてのサンプルの尤度を計算
nProbVLM(左)とCLIP(右)の比較
• 重なっている部分によって曖昧さが捉えられている
Stable diffusionを用いた埋め込みの可視化
n方法
• キャプションの予測分布から埋め込みベクトルをサンプリング
• Stable diffusionモデルに通して可視化
n結果
• 平均に近いサンプルほど
生成画像に意味のある
バリエーション
• 離れすぎると強い
アーチファクト
まとめ
nProbVLMの提案
• 凍結された大規模な決定的VLMの埋め込み分布を推定
• 下流タスクへの有用性を示す
• 予測した埋め込み分布を拡散モデルで解釈する実験

More Related Content

PDF
SPADE :Semantic Image Synthesis with Spatially-Adaptive Normalization
PPTX
[DL輪読会]Meta-Learning Probabilistic Inference for Prediction
PDF
関西CVPRML 2011.8.27
PPTX
DNNの曖昧性に関する研究動向
PPTX
CVPR2017 参加報告 速報版 本会議 2日目
PPTX
Active Learning と Bayesian Neural Network
PDF
RLアーキテクチャ勉強会 MERLIN
PPTX
Deep Learningについて(改訂版)
SPADE :Semantic Image Synthesis with Spatially-Adaptive Normalization
[DL輪読会]Meta-Learning Probabilistic Inference for Prediction
関西CVPRML 2011.8.27
DNNの曖昧性に関する研究動向
CVPR2017 参加報告 速報版 本会議 2日目
Active Learning と Bayesian Neural Network
RLアーキテクチャ勉強会 MERLIN
Deep Learningについて(改訂版)

Similar to 論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models (20)

PPTX
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
PDF
【DL輪読会】Learning Instance-Specific Adaptation for Cross-Domain Segmentation (E...
PPTX
LUT-Network ~本物のリアルタイムコンピューティングを目指して~
PPTX
Learning visual knowledge memory networks for visual question answering 文献講読
PDF
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
PPTX
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
PDF
Active Learning の基礎と最近の研究
PDF
Acl2020 taguchi
PPT
Deep Learningの技術と未来
PDF
Deep Learning Implementations: pylearn2 and torch7 (JNNS 2015)
PPTX
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
PDF
【2015.08】(3/5)cvpaper.challenge@CVPR2015
PPTX
Semi supervised, weakly-supervised, unsupervised, and active learning
PDF
Deep learning for acoustic modeling in parametric speech generation
PPTX
2021 09 29_dl_hirata
PPTX
猫でも分かるVariational AutoEncoder
PDF
【2015.07】(2/2)cvpaper.challenge@CVPR2015
PDF
Deep nlp 4.2-4.3_0309
PDF
KDD2014勉強会: Large-Scale High-Precision Topic Modeling on Twitter
PDF
Pythonによる機械学習入門〜基礎からDeep Learningまで〜
Leveraging Visual Question Answering for Image-Caption Ranking (関東CV勉強会 ECCV ...
【DL輪読会】Learning Instance-Specific Adaptation for Cross-Domain Segmentation (E...
LUT-Network ~本物のリアルタイムコンピューティングを目指して~
Learning visual knowledge memory networks for visual question answering 文献講読
【DL輪読会】How Much Can CLIP Benefit Vision-and-Language Tasks?
JSAI's AI Tool Introduction - Deep Learning, Pylearn2 and Torch7
Active Learning の基礎と最近の研究
Acl2020 taguchi
Deep Learningの技術と未来
Deep Learning Implementations: pylearn2 and torch7 (JNNS 2015)
[DL輪読会]Temporal DifferenceVariationalAuto-Encoder
【2015.08】(3/5)cvpaper.challenge@CVPR2015
Semi supervised, weakly-supervised, unsupervised, and active learning
Deep learning for acoustic modeling in parametric speech generation
2021 09 29_dl_hirata
猫でも分かるVariational AutoEncoder
【2015.07】(2/2)cvpaper.challenge@CVPR2015
Deep nlp 4.2-4.3_0309
KDD2014勉強会: Large-Scale High-Precision Topic Modeling on Twitter
Pythonによる機械学習入門〜基礎からDeep Learningまで〜
Ad

More from Toru Tamaki (20)

PDF
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
PDF
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
PDF
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
PDF
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
PDF
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
PDF
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
PDF
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
PDF
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
PDF
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
PDF
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
PDF
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
PDF
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
PDF
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
PDF
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PDF
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
PDF
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
PDF
論文紹介:On Feature Normalization and Data Augmentation
PDF
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
PDF
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
PDF
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
論文紹介:Unboxed: Geometrically and Temporally Consistent Video Outpainting
論文紹介:OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video​ Unde...
論文紹介:HOTR: End-to-End Human-Object Interaction Detection​ With Transformers, ...
論文紹介:Segment Anything, SAM2: Segment Anything in Images and Videos
論文紹介:Unbiasing through Textual Descriptions: Mitigating Representation Bias i...
論文紹介:AutoPrompt: Eliciting Knowledge from Language Models with Automatically ...
論文紹介:「Amodal Completion via Progressive Mixed Context Diffusion」「Amodal Insta...
論文紹介:「mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal La...
論文紹介:What, when, and where? ​Self-Supervised Spatio-Temporal Grounding​in Unt...
論文紹介:PitcherNet: Powering the Moneyball Evolution in Baseball Video Analytics
論文紹介:"Visual Genome:Connecting Language and Vision​Using Crowdsourced Dense I...
論文紹介:"InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning" ...
論文紹介:ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Stream...
論文紹介:Make Pixels Dance: High-Dynamic Video Generation
PCSJ-IMPS2024招待講演「動作認識と動画像符号化」2024年度画像符号化シンポジウム(PCSJ 2024) 2024年度映像メディア処理シンポジ...
論文紹介:T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise E...
論文紹介:On Feature Normalization and Data Augmentation
論文紹介:CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection
論文紹介:MS-DETR: Efficient DETR Training with Mixed Supervision
論文紹介:Synergy of Sight and Semantics: Visual Intention Understanding with CLIP
Ad

論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models