SlideShare a Scribd company logo
Speaker: Nicolò Rinaldi
INFORMATION RETRIEVAL MEETUP 2025 - London, United Kingdom
Exploring Multilingual Embeddings for
Italian Semantic Search
1
A Pretrained and Fine-tuned Approach
Who am I?
NICOLÒ RINALDI
SOFTWARE ENGINEER/DATA SCIENTIST @ Sease
● Born in Mirandola
● Master Degree in Data Science at the University
of Padua
● Semantic search, NLP, Machine Learning,
Information Retrieval and technologies
passionate
● Tennis player and Mountain lover
2
2
Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
3
3
Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
4
4
GOAL
Normal Text
5
Develop a system capable of answering customer questions in natural
language for the Italian Language
5
BENEFITS
Normal Text
6
6
● Ease of interaction
● Reduction of information overload
● Accessible to a broader range of people
7
7
HOW?
RAG: Retrieval-Augmented Generation
Normal Text
8
8
Search component
Normal Text
9
9
Dense vs lexical retrieval
Dense
Uses neural embeddings to represent queries
and documents
Computes semantic similarity in vector space
Handles synonyms and paraphrases better,
enabling semantic search.
10
10
Lexical
Based on exact word matching
Retrieves documents that share keywords with
the query
Simple and fast, but struggles with synonyms or
paraphrases.
Symmetric vs Asymmetric semantic search
Asymmetric
short queries and longer documents , where we
hope to find informations answering the input
query
11
11
Symmetric
queries and the documents in the corpus are
about the same length
Model to beat: text-embedding-3-small (OpenAI)
CONS
Embedding dimension is big -> slow search
API call each query during inference -> slow
inference
The costs keep increasing
12
12
SPECS
Embedding dimension: 1536
Cost: 0.02$ per 1M tokens
Roadmap
13
13
Find Models
Find Datasets
Fine-tune the models
Overview
1
2
3
4
5
Introduction
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
14
14
Train dataset: mMARCO
15
15
14 different languages, 39M samples
query quale frutto è originario dell'australia
pos
Passiflora herbertiana. Un raro frutto della passione originario dell'Australia. I frutti sono a buccia verde, a polpa
bianca, con una valutazione commestibile sconosciuta. Alcune fonti elencano il frutto come commestibile, dolce
e gustoso, mentre altre elencano i frutti come amari e non commestibili.
neg
La noce di cola è il frutto dell'albero di cola, un genere (Cola) di alberi originari delle foreste pluviali tropicali
dell'Africa.
Validation and Test set: Dbpedia-Entity-v2
16
16
4.6M documents, 400 test queries
and 67 validation queries
qrels:
● 0 : non-relevant
● 1 : relevant
● 2 : doc is the answer to the
query
Validation and Test set: Dbpedia-Entity-v2
17
17
4.6M documents, 400 test queries
and 67 validation queries
qrels:
● 0 : non-relevant
● 1 : relevant
● 2 : doc is the answer to the
query
Validation and Test set: Dbpedia-Entity-v2
18
18
4.6M documents, 400 test queries
and 67 validation queries
qrels:
● 0 : non-relevant
● 1 : relevant
● 2 : doc is the answer to the
query
Validation and Test set: Dbpedia-Entity-v2
19
19
_id: <dbpedia:Afghan_cuisine>
title: Afghan cuisine
text: La cucina afghana si basa in gran parte sulle colture
principali della nazione, come grano, mais, orzo e riso.
Accompagnando queste materie prime sono frutta e verdura
autoctoni, così come prodotti lattiero-caseari come latte,
yogurt e siero di latte. Kabuli Palaw è il piatto nazionale
dell'Afghanistan. Le specialità culinarie della nazione
riflettono la sua diversità etnica e geografica. L'Afghanistan è
noto per i suoi melograni di alta qualità, uva e meloni a
forma di calcio dolce.
_id: INEX_LD-2009022
text: Szechwan piatto cucina alimentare
query-id: INEX_LD-2009022
corpus-id: <dbpedia:Afghan_cuisine>
score: 0
Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
20
20
21
21
Models
Baselines
(count based)
Neural Network
based
LaBSE
BM25
TF-IDF
multilingual-e5-large
bge-m3
MTEB
Massive Text Embedding
Benchmark
22
22
TF-IDF: Term Frequency-Inverse Document Frequency
Term: t
Document: d
No semantic representation, just
keyword matching
TF and IDF are stored alongside the
inverted index during indexing
construction. This makes retrieval
very fast.
23
23
BM25: Best Match 25
TF_(qi,d) is the frequency of term qi in the document d.
|d| is the length of the document d.
|davg| is the length of the average document of all the documents in the
corpus.
IDF_qi denotes the Inverse Document Frequency for term qi.
Params: k1 is set between 1.2 and 2.0, and b is often set around 0.75
24
24
BERT: Bidirectional Encoder Representation from Transformer
25
25
Sentence-BERT: Sentence embedding using siamese BERT-network
Various option for similarity function:
- Cosine similarity
- Dot product
Various case of pooling methods:
- [CLS] token embedding
- Mean of all the token of the input
26
26
LaBSE: Language-Agnostic BERT Sentence Embedding
From Google AI
109 languages supported
- window: 256 tokens
- parameters: 471M
- emb dim: 768
- pooling: [CLS] token
- pre-trained model: BERT base (12
transformer's encoder layers)
27
27
Multilingual E5 Text Embeddings: large version
From Microsoft Corporation
Tested on over 100 languages
- window: 512 tokens, but needs prefix “query: “
- parameters: 559M
- emb dim: 1024
- pooling: Mean of all the input tokens
- pre-trained model: XLM-RoBERTa large (24
transformer's encoder layers)
E5
28
28
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi- -Granularity
Text Embeddings Through Self-Knowledge Distillation
From Beijing Academy of Artificial Intelligence
More than 100 languages supported
- window: 8192 tokens!
- parameters: 567M
- emb dim: 1024
- pooling: [CLS] token
- pre-trained model: XLM-RoBERTa large (24
transformer's encoder layers)
29
29
Fine-tuning
Triplets Loss
positive
document
embedding
negative
document
embedding
query
embedding
Pre-trained
model
Pre-trained
model
Pre-trained
model
query text
positive
document
text
negative
document
text
ADAPTER
Performed on AWS machine
g5.xlarge, composed of one
GPU NVIDIA A10G
Support for training using
bfloat16 data type
30
30
Fine-tuning hyperparameters
Model Adapter Learning Rate Adapter inner
dim
LaBSE linear None
mult-e5-
large
linear
None
bge-m3
linear
non-linear
[512, 1024, 2048]
- Optimizer: AdamW (decay=0.01)
- Batch size: max possible for the model
- 10 epochs of 5M samples
- 5 validation each epoch with 10k samples
31
31
Fine-tuning hyperparameters
Model Adapter Learning Rate Adapter inner
dim
LaBSE linear None
mult-e5-
large
linear
None
bge-m3
linear
non-linear
[512, 1024, 2048]
- Optimizer: AdamW (decay=0.01)
- Batch size: max possible for the model
- 10 epochs of 5M samples
- 5 validation each epoch with 10k samples
Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
32
32
33
33
Skip lists
34
34
Navigable small world graphs
Both short and long
range connections
Greedy search
35
35
HNSW: skip lists + Navigable small world
Layer where to start insertion is chosen by
sampling from skewed probability
Main parameters:
● M: number of nearest neighbours
linked when a node is inserted
● ef_construction: number of nearest
neighbours looked at during
insertion
This changes both time and accuracy of
the search algorithm
36
36
Metrics explored
● nDCG@k (normalized Discounted Cumulative Gain)
● MRR@k (Mean Reciprocal Rank)
● MAP@k (Mean Average Precision)
● Precision@k
● Recall@k
● F1@k
As a rule of thumb for all our evaluations, we used k=10.
Implementation
Neural embedding models
37
37
Count based models
38
38
Vector Database
Implementation: Weaviate
Search method:
● HNSW
● exhaustive
Statistical Test
Wilcoxon signed-rank test
● Non-parametric
● Does NOT assume a
normal distribution of
the data
Main metric
Evaluation summary
Overview
1
2
3
4
5
Introduction and Motivation
Datasets
Model Structure and Fine-tuning
Evaluation
Results and Conclusions
39
39
40
40
Time complexity
Model Embedding dimension Average search time
LaBSE 768 32 ms
multilingual-e5-large linear 1024 47 ms
bge-m3 1024 52 ms
text-embedding-3-small 1536 615 ms
41
41
42
42
Partial success of fine-
tuning method
2 out of 3 of the models have
significantly improved the
performance compared with
their pre-trained version
Inference time
lowered
Using the best model instead
of text-embedding-3-small
from OpenAI reduces the
average inference time of one
order of magnitude (can vary
depending of the internet
connection)
Cost-effective solution
Initially the cost of the training
is higher, but, since the best
model can run locally the
inference cost is free, in
contrast to the OpenAI model
Conclusions
43
43
Future works
Boost
pipeline
Integration
into RAG
Fine-tuning
Dataset
Larger or domain-specific corpora
Find or create aligned English-Italian
datasets to avoid translation
Reranking
Query Rewriting
Prefix tuning Prompt-tuning
User-centered
evaluation
metrics
THANK YOU
sease.io
THANK YOU
sease.io
44

More Related Content

PDF
Triantafyllia Voulibasi
PPTX
IA3_presentation.pptx
PDF
cyclades eswc2016
PDF
Advanced Natural Language Processing with Apache Spark NLP
PPTX
short-story.pptx
PDF
Large Language Models for Test Case Evolution and Repair
PDF
Ontology-based data access: why it is so cool!
PPT
Satisfiability
Triantafyllia Voulibasi
IA3_presentation.pptx
cyclades eswc2016
Advanced Natural Language Processing with Apache Spark NLP
short-story.pptx
Large Language Models for Test Case Evolution and Repair
Ontology-based data access: why it is so cool!
Satisfiability

Similar to Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach (20)

PPTX
Semantic-Aware Code Model: Elevating the Future of Software Development
PDF
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
DOCX
Culturally Responsive Literacy Resources Template Part 1.docx
PDF
Runtime Performance Optimizations for an OpenFOAM Simulation
PPTX
Error control coding techniques
PDF
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
PDF
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
DOC
Navodaya R Experience resume
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
DOC
Cv of manjunath kudari
PDF
Improving classification accuracy for customer contact transcriptions
PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PDF
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
PPTX
RDF Join Query Processing with Dual Simulation Pruning
PDF
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
PPT
Thesis Giani UIC Slides EN
PDF
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Semantic-Aware Code Model: Elevating the Future of Software Development
Top 10 present and future innovations in the NoSQL Cassandra ecosystem (2022)
Culturally Responsive Literacy Resources Template Part 1.docx
Runtime Performance Optimizations for an OpenFOAM Simulation
Error control coding techniques
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Navodaya R Experience resume
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Cv of manjunath kudari
Improving classification accuracy for customer contact transcriptions
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
RDF Join Query Processing with Dual Simulation Pruning
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Thesis Giani UIC Slides EN
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Ad

More from Sease (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
PPTX
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
PPTX
From Natural Language to Structured Solr Queries using LLMs
PPTX
Hybrid Search With Apache Solr
PPTX
Multi Valued Vectors Lucene
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
PPTX
How does ChatGPT work: an Information Retrieval perspective
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PPTX
Neural Search Comes to Apache Solr
PPTX
Large Scale Indexing
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
PPTX
How to cache your searches_ an open source implementation.pptx
PDF
Online Testing Learning to Rank with Solr Interleaving
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Building Search Using OpenSearch: Limitations and Workarounds
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
From Natural Language to Structured Solr Queries using LLMs
Hybrid Search With Apache Solr
Multi Valued Vectors Lucene
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
How To Implement Your Online Search Quality Evaluation With Kibana
Introducing Multi Valued Vectors Fields in Apache Lucene
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
How does ChatGPT work: an Information Retrieval perspective
How To Implement Your Online Search Quality Evaluation With Kibana
Neural Search Comes to Apache Solr
Large Scale Indexing
Dense Retrieval with Apache Solr Neural Search.pdf
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
How to cache your searches_ an open source implementation.pptx
Online Testing Learning to Rank with Solr Interleaving
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Ad

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to machine learning and Linear Models
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Data Science and Data Analysis
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Supervised vs unsupervised machine learning algorithms
climate analysis of Dhaka ,Banglades.pptx
Clinical guidelines as a resource for EBP(1).pdf
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
Fluorescence-microscope_Botany_detailed content
Introduction to machine learning and Linear Models
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained and Fine-tuned Approach

  • 1. Speaker: Nicolò Rinaldi INFORMATION RETRIEVAL MEETUP 2025 - London, United Kingdom Exploring Multilingual Embeddings for Italian Semantic Search 1 A Pretrained and Fine-tuned Approach
  • 2. Who am I? NICOLÒ RINALDI SOFTWARE ENGINEER/DATA SCIENTIST @ Sease ● Born in Mirandola ● Master Degree in Data Science at the University of Padua ● Semantic search, NLP, Machine Learning, Information Retrieval and technologies passionate ● Tennis player and Mountain lover 2 2
  • 3. Overview 1 2 3 4 5 Introduction and Motivation Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 3 3
  • 4. Overview 1 2 3 4 5 Introduction and Motivation Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 4 4
  • 5. GOAL Normal Text 5 Develop a system capable of answering customer questions in natural language for the Italian Language 5
  • 6. BENEFITS Normal Text 6 6 ● Ease of interaction ● Reduction of information overload ● Accessible to a broader range of people
  • 10. Dense vs lexical retrieval Dense Uses neural embeddings to represent queries and documents Computes semantic similarity in vector space Handles synonyms and paraphrases better, enabling semantic search. 10 10 Lexical Based on exact word matching Retrieves documents that share keywords with the query Simple and fast, but struggles with synonyms or paraphrases.
  • 11. Symmetric vs Asymmetric semantic search Asymmetric short queries and longer documents , where we hope to find informations answering the input query 11 11 Symmetric queries and the documents in the corpus are about the same length
  • 12. Model to beat: text-embedding-3-small (OpenAI) CONS Embedding dimension is big -> slow search API call each query during inference -> slow inference The costs keep increasing 12 12 SPECS Embedding dimension: 1536 Cost: 0.02$ per 1M tokens
  • 14. Overview 1 2 3 4 5 Introduction Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 14 14
  • 15. Train dataset: mMARCO 15 15 14 different languages, 39M samples query quale frutto è originario dell'australia pos Passiflora herbertiana. Un raro frutto della passione originario dell'Australia. I frutti sono a buccia verde, a polpa bianca, con una valutazione commestibile sconosciuta. Alcune fonti elencano il frutto come commestibile, dolce e gustoso, mentre altre elencano i frutti come amari e non commestibili. neg La noce di cola è il frutto dell'albero di cola, un genere (Cola) di alberi originari delle foreste pluviali tropicali dell'Africa.
  • 16. Validation and Test set: Dbpedia-Entity-v2 16 16 4.6M documents, 400 test queries and 67 validation queries qrels: ● 0 : non-relevant ● 1 : relevant ● 2 : doc is the answer to the query
  • 17. Validation and Test set: Dbpedia-Entity-v2 17 17 4.6M documents, 400 test queries and 67 validation queries qrels: ● 0 : non-relevant ● 1 : relevant ● 2 : doc is the answer to the query
  • 18. Validation and Test set: Dbpedia-Entity-v2 18 18 4.6M documents, 400 test queries and 67 validation queries qrels: ● 0 : non-relevant ● 1 : relevant ● 2 : doc is the answer to the query
  • 19. Validation and Test set: Dbpedia-Entity-v2 19 19 _id: <dbpedia:Afghan_cuisine> title: Afghan cuisine text: La cucina afghana si basa in gran parte sulle colture principali della nazione, come grano, mais, orzo e riso. Accompagnando queste materie prime sono frutta e verdura autoctoni, così come prodotti lattiero-caseari come latte, yogurt e siero di latte. Kabuli Palaw è il piatto nazionale dell'Afghanistan. Le specialità culinarie della nazione riflettono la sua diversità etnica e geografica. L'Afghanistan è noto per i suoi melograni di alta qualità, uva e meloni a forma di calcio dolce. _id: INEX_LD-2009022 text: Szechwan piatto cucina alimentare query-id: INEX_LD-2009022 corpus-id: <dbpedia:Afghan_cuisine> score: 0
  • 20. Overview 1 2 3 4 5 Introduction and Motivation Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 20 20
  • 22. 22 22 TF-IDF: Term Frequency-Inverse Document Frequency Term: t Document: d No semantic representation, just keyword matching TF and IDF are stored alongside the inverted index during indexing construction. This makes retrieval very fast.
  • 23. 23 23 BM25: Best Match 25 TF_(qi,d) is the frequency of term qi in the document d. |d| is the length of the document d. |davg| is the length of the average document of all the documents in the corpus. IDF_qi denotes the Inverse Document Frequency for term qi. Params: k1 is set between 1.2 and 2.0, and b is often set around 0.75
  • 24. 24 24 BERT: Bidirectional Encoder Representation from Transformer
  • 25. 25 25 Sentence-BERT: Sentence embedding using siamese BERT-network Various option for similarity function: - Cosine similarity - Dot product Various case of pooling methods: - [CLS] token embedding - Mean of all the token of the input
  • 26. 26 26 LaBSE: Language-Agnostic BERT Sentence Embedding From Google AI 109 languages supported - window: 256 tokens - parameters: 471M - emb dim: 768 - pooling: [CLS] token - pre-trained model: BERT base (12 transformer's encoder layers)
  • 27. 27 27 Multilingual E5 Text Embeddings: large version From Microsoft Corporation Tested on over 100 languages - window: 512 tokens, but needs prefix “query: “ - parameters: 559M - emb dim: 1024 - pooling: Mean of all the input tokens - pre-trained model: XLM-RoBERTa large (24 transformer's encoder layers) E5
  • 28. 28 28 BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi- -Granularity Text Embeddings Through Self-Knowledge Distillation From Beijing Academy of Artificial Intelligence More than 100 languages supported - window: 8192 tokens! - parameters: 567M - emb dim: 1024 - pooling: [CLS] token - pre-trained model: XLM-RoBERTa large (24 transformer's encoder layers)
  • 30. 30 30 Fine-tuning hyperparameters Model Adapter Learning Rate Adapter inner dim LaBSE linear None mult-e5- large linear None bge-m3 linear non-linear [512, 1024, 2048] - Optimizer: AdamW (decay=0.01) - Batch size: max possible for the model - 10 epochs of 5M samples - 5 validation each epoch with 10k samples
  • 31. 31 31 Fine-tuning hyperparameters Model Adapter Learning Rate Adapter inner dim LaBSE linear None mult-e5- large linear None bge-m3 linear non-linear [512, 1024, 2048] - Optimizer: AdamW (decay=0.01) - Batch size: max possible for the model - 10 epochs of 5M samples - 5 validation each epoch with 10k samples
  • 32. Overview 1 2 3 4 5 Introduction and Motivation Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 32 32
  • 34. 34 34 Navigable small world graphs Both short and long range connections Greedy search
  • 35. 35 35 HNSW: skip lists + Navigable small world Layer where to start insertion is chosen by sampling from skewed probability Main parameters: ● M: number of nearest neighbours linked when a node is inserted ● ef_construction: number of nearest neighbours looked at during insertion This changes both time and accuracy of the search algorithm
  • 36. 36 36 Metrics explored ● nDCG@k (normalized Discounted Cumulative Gain) ● MRR@k (Mean Reciprocal Rank) ● MAP@k (Mean Average Precision) ● Precision@k ● Recall@k ● F1@k As a rule of thumb for all our evaluations, we used k=10.
  • 38. 38 38 Vector Database Implementation: Weaviate Search method: ● HNSW ● exhaustive Statistical Test Wilcoxon signed-rank test ● Non-parametric ● Does NOT assume a normal distribution of the data Main metric Evaluation summary
  • 39. Overview 1 2 3 4 5 Introduction and Motivation Datasets Model Structure and Fine-tuning Evaluation Results and Conclusions 39 39
  • 40. 40 40 Time complexity Model Embedding dimension Average search time LaBSE 768 32 ms multilingual-e5-large linear 1024 47 ms bge-m3 1024 52 ms text-embedding-3-small 1536 615 ms
  • 41. 41 41
  • 42. 42 42 Partial success of fine- tuning method 2 out of 3 of the models have significantly improved the performance compared with their pre-trained version Inference time lowered Using the best model instead of text-embedding-3-small from OpenAI reduces the average inference time of one order of magnitude (can vary depending of the internet connection) Cost-effective solution Initially the cost of the training is higher, but, since the best model can run locally the inference cost is free, in contrast to the OpenAI model Conclusions
  • 43. 43 43 Future works Boost pipeline Integration into RAG Fine-tuning Dataset Larger or domain-specific corpora Find or create aligned English-Italian datasets to avoid translation Reranking Query Rewriting Prefix tuning Prompt-tuning User-centered evaluation metrics

Editor's Notes

  • #8: BENEFITS: Transparency reliability avoids "hallucination" up-to-date information CHALLENGES: documents retrieved must be relevant retrieval step additional computation integrating the retrieved context into the generative process
  • #9: OBJECTIVE: develop a cost-effective and high-performance semantic search system with document embeddings For those who don’t know what embeddings are: they are high dimensional vector that represent the semantic load of the chunk/document
  • #17: Here, talk about the chunking strategy and the setting used in the relational database (sqlite) to insert all the documents for testing and validation. More than that, remember to cite the subsampling part