MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

MCSE: Multimodal Contrastive Learning of
Sentence Embeddings
Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, and Dietrich Klakow,
2022

Experiments
Conclusions
Introduction
Related Work
01
02
03
04

Introduction
• MCSE : Multimodal Contrastive Learning of Sentence Embeddings
 Background: Unsupervised SimCSE (Gao et al., 2021)
 Extend a multimodal contrastive objective
 Experiments on standard Semantic Textual Similarity (STS)

• Architecture of MCSE
Introduction
f v (·
) is a pre-trained image encoder such as ResNet

• Contrastive learning : background Unsupervised SimCSE
 data augmentation strategy : dropout noise
 pulling positive sentences closer and pushing apart negatives
Related Work
Cosine similarity

• Multimodal Contrastive Learning
 sentence-image pairs , sentence xi and image yi
• f v (·
) : pre-trained image encoder such as ResNet
• fθ(·
) : pre-trained language encoder such as BERT
 pull semantically close image-sentence pairs together and push away non-related pairs
Related Work

• Dataset
 Multimodal datasets : Flickr30k (29,783 images) and MS-COCO (82,783 images)
 text corpus : Wiki1M (English Wikipedia : 106 sentences)
• Encoder
 Language encoders : BERT and RoBERTa
 Image encoder : ResNet-50
Single layer MLPs
• Evaluation
 7 Semantic Textual Similarity (STS) : STS 2012-2016, STS Benchmark, SICK-Relatedness
 Spearman’s correlation
Experiments

Results
• MCSE : Wiki1M, Flickr30k
BERT (76.3 → 77.3)
RoBERTa (76.6 → 78.3)
• STS16 MCSE-BERT
-> the domain discrepancy
Performance comparison on STS tasks

Results
• the performances decrease
(without the large text-only corpus)
• MCSE models (0.9 – 3.8 points improvement)
• Spearman’s correlation(0.8 – 5.0 points reduction)
-> validating the efficacy of visual semantics
Average Spearman’s correlation on 7 STS tasks

Results
• Alignment-Uniformity
 Alignment : paired instance 사이의 거리
(짧을수록 좋음)
Similar samples have similar features
 Uniformity : embedding이 얼만큼 균일하게
분포하는지 (균일 할수록 좋음)
Preserve maximal information
* 참고 논문 : Understanding Contrastive Representation Learning through
Alignment and Uniformity on the Hypersphere (ICML 2020)
• Embedding space가 넓고, 고르게
분포하여 각 단어가 고유한 의미를
보존하는 것이 중요함.
• Contrastive learning을 통해
Negative Pair를 Positive Pair와 멀게
강제하는 과정에서 embedding space를
균일하게 분포하게 함.

Results
• Alignment-Uniformity
 PPOS : positive pairs distribution
 Pdata : data distribution
MCSE models : visually grounding
enhance by improving the
alignment property
The alignment-uniformity plot of models (BERT)

Results
• Improvements on Different Subsets
 different degrees from the visually grounding
because of domain discrepancy

Results
• SimCSE는 구문이 유사한 문장을 검색하는 반면
MCSE는 구문이 다양하고 의미 체계를 공유하는 문장을 검색

Results
• Cross-Modal Retrieval : metric Recall@K
 Recall@K : k개 추천 결과에 대한 recall

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

Conclusion
• MCSE 제안 : sentence embedding learning
• MCSE consistently improves the performance on STS tasks
• the superiority of method : by analyzing the alignment and uniformity properties
of the embedding space.
• SimCSE는 limited SAMPLE에서 MCSE 보다 나은 성능을 보임
MCSE는 큰 데이터에서는 SimCSE 성능을 능가함.
-> multimodal weight training 관련

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

More Related Content

More from taeseon ryu (20)

Recently uploaded (20)

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정