When Short Queries Meet Long Documents:

When you have a database full of large documents and want to retrieve relevant results using a short query like 3-4 words, you have 3 options: lexical search , fuzzy search and semantic search.

Lexical search (string matching/ phrase match) fails to retrieve documents where a synonym is present or can incorrectly retrieves document that are not semantically relevant. For example, a search for "Spring" may retrieve information on the season rather than software frameworks. Similarly, Fuzzy search also have some disadvantages .

How about semantic search? Use a Bert like model, pre-compute the embeddings for documents , store in a vector db and during retrieval stage, generate the query embedding, and match against a document's embedding and return the documents with highest cos-sim/dot product score. Majority of the IR (Information Retrieval) systems now a days use this approach and it works well when both the query and documents are relatively long. And for shorter queries and longer documents it has 2 issues:

The method of embedding generation: We pass the query/document through an encoder(Bert like) and mean-pool the last layer.
Shorter query can lead to a sparse vector and longer documents can generate a dense vector representation in most of the cases.

The first issue can lead to Loss of Key Information because of averaging word vectors , as it may drown out crucial information dimensions. This is the problem we've faced and tackled at Auquan using the "Late interaction" approach presented in ColBert paper.

Late Interaction to the Rescue

ColBERT's late interaction model allows for more fine-grained word-level interaction between query and document words (tokens). This approach retains contextual information for each word, rather than averaging everything into a single vector. Here's how it works:

Pre-generate word(token) embeddings for the document and store them in a vector database. We're not mean-pooling, rather taking the last layer's word vectors. If a document has 400 words, it generates 400 vectors.
For shorter queries, instead of computing a single embedding, generate word embeddings for each word.
Calculates the similarity between each token in the query and every token in the document using the dot product.
Max-similarity: For each query token, the highest similarity with any document token is taken.
Aggregation: The final score is the average of all max-similarity scores, representing the relevance of the document to the query.

For example: Let's say we have a query with 3 words and a document with 5 words and the vector dimension is 4 for the illustration.

Dot Product Calculation:

For each word in the query, compute the dot product with each word in the document. The dot product is calculated as:

Dot product: Multiply corresponding elements from the two vectors and sum them.

Dot product scores:

After calculating the dot products for Query Token 1, the process is repeated for the other query tokens.

The resulting scores matrix will look like this:

scores: A matrix where each row corresponds to a query token, and each column represents the similarity with a document token. The values are the dot products (similarities) between the embeddings of the corresponding tokens.

Next Step: Max-Sim

On the scores matrix, we perform max-pooling along the document token dimension to retain only the highest similarity for each query token. This is done with:

max_scores = torch.max(scores, dim=1).values

Let’s break it down:

Max-pooling along document tokens: We take the maximum value from each row (which represents the similarity scores for one query token with all document tokens).

For Query Token 1: The maximum value in [11.8, 5.0, 13.6, 21.8, 12.6] is 21.8.
For Query Token 2: The maximum value in [9.8, 4.0, 10.6, 16.1, 10.1] is 16.1.
For Query Token 3: The maximum value in [8.6, 5.0, 12.2, 20.9, 11.7] is 20.9.

Result after Max-Pooling: After taking the max value for each query token, the resulting tensor is:

max_scores = torch.tensor([21.8, 16.1, 20.9])

This gives us the most relevant similarity score for each query token when matched against all the document tokens.

Final Step: Aggregating Scores

The next part of the code averages the maximum scores to produce a single relevance score for the entire query-document pair:

final_score = max_scores.mean()  # Average of max scores
 # in our case it looks like below.
final_score = (21.8 + 16.1 + 20.9) / 3 = 19.6

Interpretation:

The final score 19.6 represents the overall relevance of the document to the query based on token-level interactions.
By taking the maximum similarity for each query token and averaging, the model focuses on the best match for each part of the query and aggregates them into one score.

Similarly, we calculate the score for all the documents and rank them based on the score.

I'd like to thank Leonie Monigatti , Jina AI for the excellent articles and blog post on the Late Interaction.

Feel free to leave your comments here, happy to discuss any thing related to tech.

#LateInteraction #InformationRetrieval #ColBERT #SemanticSearch #AIML #NLP

When Short Queries Meet Long Documents:

Anil Kumar Reddy

Sr. Machine Learning Engineer | AI Strategy | M.Tech

Late Interaction to the Rescue

Dot product scores:

Next Step: Max-Sim

Final Step: Aggregating Scores

Interpretation:

Others also viewed

From Triples to Truth: Why SHACL Will Not Save Ontological Engineering

Semantic Modeling WITHOUT third-party tools

OWL and SHACL: Complementary Tools for Semantic Interoperability — Not Rivals

No, RAG Isn't Dead, It Just Leveled Up as Context Engineering

Expanded Explanation: Semantic Search and Knowledge Graphs with Large Language Models (LLMs)

Dr RAG and Mr HyDE

The role of information gain in phrase based indexing and passage retrieval

Combined usage of SKOS and OWL: an experimentation on the Digital Europa Thesaurus

Building a Flexible RAG System with LangChain, Ollama, and Milvus

Similarity search is not a silver spoon!

Explore content categories