Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach

Speaker: Nicolò Rinaldi
INFORMATION RETRIEVAL MEETUP 2025 - London, United Kingdom
Exploring Multilingual Embeddings for
Italian Semantic Search
1
A Pretrained and Fine-tuned Approach

Who am I?
NICOLÒ RINALDI
SOFTWARE ENGINEER/DATA SCIENTIST @ Sease
● Born in Mirandola
● Master Degree in Data Science at the University
of Padua
● Semantic search, NLP, Machine Learning,
Information Retrieval and technologies
passionate
● Tennis player and Mountain lover
2
2

Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
3
3

Overview
1
2
3
4
5
Datasets
Evaluation
4
4

GOAL
Normal Text
5
Develop a system capable of answering customer questions in natural
language for the Italian Language
5

BENEFITS
Normal Text
6
6
● Ease of interaction
● Reduction of information overload
● Accessible to a broader range of people

RAG: Retrieval-Augmented Generation
Normal Text
8
8

Search component
Normal Text
9
9

Dense vs lexical retrieval
Dense
Uses neural embeddings to represent queries
and documents
Computes semantic similarity in vector space
Handles synonyms and paraphrases better,
enabling semantic search.
10
10
Lexical
Based on exact word matching
Retrieves documents that share keywords with
the query
Simple and fast, but struggles with synonyms or
paraphrases.

Symmetric vs Asymmetric semantic search
Asymmetric
short queries and longer documents , where we
hope to find informations answering the input
query
11
11
Symmetric
queries and the documents in the corpus are
about the same length

Model to beat: text-embedding-3-small (OpenAI)
CONS
Embedding dimension is big -> slow search
API call each query during inference -> slow
inference
The costs keep increasing
12
12
SPECS
Embedding dimension: 1536
Cost: 0.02$ per 1M tokens

Roadmap
13
13
Find Models
Find Datasets
Fine-tune the models

Overview
1
2
3
4
5
Introduction
Datasets
Evaluation
14
14

Train dataset: mMARCO
15
15
14 different languages, 39M samples
query quale frutto è originario dell'australia
pos
Passiflora herbertiana. Un raro frutto della passione originario dell'Australia. I frutti sono a buccia verde, a polpa
bianca, con una valutazione commestibile sconosciuta. Alcune fonti elencano il frutto come commestibile, dolce
e gustoso, mentre altre elencano i frutti come amari e non commestibili.
neg
La noce di cola è il frutto dell'albero di cola, un genere (Cola) di alberi originari delle foreste pluviali tropicali
dell'Africa.

Validation and Test set: Dbpedia-Entity-v2
16
16
4.6M documents, 400 test queries
and 67 validation queries
qrels:
● 0 : non-relevant
● 1 : relevant
● 2 : doc is the answer to the
query

17
17
qrels:
● 1 : relevant
query

18
18
qrels:
● 1 : relevant
query

19
19
_id: <dbpedia:Afghan_cuisine>
title: Afghan cuisine
text: La cucina afghana si basa in gran parte sulle colture
principali della nazione, come grano, mais, orzo e riso.
Accompagnando queste materie prime sono frutta e verdura
autoctoni, così come prodotti lattiero-caseari come latte,
yogurt e siero di latte. Kabuli Palaw è il piatto nazionale
dell'Afghanistan. Le specialità culinarie della nazione
riflettono la sua diversità etnica e geografica. L'Afghanistan è
noto per i suoi melograni di alta qualità, uva e meloni a
forma di calcio dolce.
_id: INEX_LD-2009022
text: Szechwan piatto cucina alimentare
query-id: INEX_LD-2009022
corpus-id: <dbpedia:Afghan_cuisine>
score: 0

Overview
1
2
3
4
5
Datasets
Evaluation
20
20

21
21
Models
Baselines
(count based)
Neural Network
based
LaBSE
BM25
TF-IDF
multilingual-e5-large
bge-m3
MTEB
Massive Text Embedding
Benchmark

22
22
TF-IDF: Term Frequency-Inverse Document Frequency
Term: t
Document: d
No semantic representation, just
keyword matching
TF and IDF are stored alongside the
inverted index during indexing
construction. This makes retrieval
very fast.

23
23
BM25: Best Match 25
TF_(qi,d) is the frequency of term qi in the document d.
|d| is the length of the document d.
|davg| is the length of the average document of all the documents in the
corpus.
IDF_qi denotes the Inverse Document Frequency for term qi.
Params: k1 is set between 1.2 and 2.0, and b is often set around 0.75

24
24
BERT: Bidirectional Encoder Representation from Transformer

25
25
Sentence-BERT: Sentence embedding using siamese BERT-network
Various option for similarity function:
- Cosine similarity
- Dot product
Various case of pooling methods:
- [CLS] token embedding
- Mean of all the token of the input

26
26
LaBSE: Language-Agnostic BERT Sentence Embedding
From Google AI
109 languages supported
- window: 256 tokens
- parameters: 471M
- emb dim: 768
- pooling: [CLS] token
- pre-trained model: BERT base (12
transformer's encoder layers)

27
27
Multilingual E5 Text Embeddings: large version
From Microsoft Corporation
Tested on over 100 languages
- window: 512 tokens, but needs prefix “query: “
- parameters: 559M
- emb dim: 1024
- pooling: Mean of all the input tokens
- pre-trained model: XLM-RoBERTa large (24
E5

28
28
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi- -Granularity
Text Embeddings Through Self-Knowledge Distillation
From Beijing Academy of Artificial Intelligence
More than 100 languages supported
- window: 8192 tokens!
- parameters: 567M
- emb dim: 1024
- pooling: [CLS] token
- pre-trained model: XLM-RoBERTa large (24

29
29
Fine-tuning
Triplets Loss
positive
document
embedding
negative
document
embedding
query
embedding
Pre-trained
model
Pre-trained
model
Pre-trained
model
query text
positive
document
text
negative
document
text
ADAPTER
Performed on AWS machine
g5.xlarge, composed of one
GPU NVIDIA A10G
Support for training using
bfloat16 data type

30
30
Fine-tuning hyperparameters
Model Adapter Learning Rate Adapter inner
dim
LaBSE linear None
mult-e5-
large
linear
None
bge-m3
linear
non-linear
[512, 1024, 2048]
- Optimizer: AdamW (decay=0.01)
- Batch size: max possible for the model
- 10 epochs of 5M samples
- 5 validation each epoch with 10k samples

31
31
Fine-tuning hyperparameters
Model Adapter Learning Rate Adapter inner
dim
LaBSE linear None
mult-e5-
large
linear
None
bge-m3
linear
non-linear
[512, 1024, 2048]
- Optimizer: AdamW (decay=0.01)
- Batch size: max possible for the model
- 10 epochs of 5M samples
- 5 validation each epoch with 10k samples

Overview
1
2
3
4
5
Datasets
Evaluation
32
32

34
34
Navigable small world graphs
Both short and long
range connections
Greedy search

35
35
HNSW: skip lists + Navigable small world
Layer where to start insertion is chosen by
sampling from skewed probability
Main parameters:
● M: number of nearest neighbours
linked when a node is inserted
● ef_construction: number of nearest
neighbours looked at during
insertion
This changes both time and accuracy of
the search algorithm

36
36
Metrics explored
● nDCG@k (normalized Discounted Cumulative Gain)
● MRR@k (Mean Reciprocal Rank)
● MAP@k (Mean Average Precision)
● Precision@k
● Recall@k
● F1@k
As a rule of thumb for all our evaluations, we used k=10.

Implementation
Neural embedding models
37
37
Count based models

38
38
Vector Database
Implementation: Weaviate
Search method:
● HNSW
● exhaustive
Statistical Test
Wilcoxon signed-rank test
● Non-parametric
● Does NOT assume a
normal distribution of
the data
Main metric
Evaluation summary

Overview
1
2
3
4
5
Datasets
Evaluation
39
39

40
40
Time complexity
Model Embedding dimension Average search time
LaBSE 768 32 ms
multilingual-e5-large linear 1024 47 ms
bge-m3 1024 52 ms
text-embedding-3-small 1536 615 ms

42
42
Partial success of fine-
tuning method
2 out of 3 of the models have
significantly improved the
performance compared with
their pre-trained version
Inference time
lowered
Using the best model instead
of text-embedding-3-small
from OpenAI reduces the
average inference time of one
order of magnitude (can vary
depending of the internet
connection)
Cost-effective solution
Initially the cost of the training
is higher, but, since the best
model can run locally the
inference cost is free, in
contrast to the OpenAI model
Conclusions

43
43
Future works
Boost
pipeline
Integration
into RAG
Fine-tuning
Dataset
Larger or domain-specific corpora
Find or create aligned English-Italian
datasets to avoid translation
Reranking
Query Rewriting
Prefix tuning Prompt-tuning
User-centered
evaluation
metrics

THANK YOU
sease.io
THANK YOU
sease.io
44

Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach

More Related Content

Similar to Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach (20)

More from Sease (20)

Recently uploaded (20)

Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach

Editor's Notes