When Short Queries Meet Long Documents:
When you have a database full of large documents and want to retrieve relevant results using a short query like 3-4 words, you have 3 options: lexical search , fuzzy search and semantic search.
Lexical search (string matching/ phrase match) fails to retrieve documents where a synonym is present or can incorrectly retrieves document that are not semantically relevant. For example, a search for "Spring" may retrieve information on the season rather than software frameworks. Similarly, Fuzzy search also have some disadvantages .
How about semantic search? Use a Bert like model, pre-compute the embeddings for documents , store in a vector db and during retrieval stage, generate the query embedding, and match against a document's embedding and return the documents with highest cos-sim/dot product score. Majority of the IR (Information Retrieval) systems now a days use this approach and it works well when both the query and documents are relatively long. And for shorter queries and longer documents it has 2 issues:
The first issue can lead to Loss of Key Information because of averaging word vectors , as it may drown out crucial information dimensions. This is the problem we've faced and tackled at Auquan using the "Late interaction" approach presented in ColBert paper.
Late Interaction to the Rescue
ColBERT's late interaction model allows for more fine-grained word-level interaction between query and document words (tokens). This approach retains contextual information for each word, rather than averaging everything into a single vector. Here's how it works:
For example: Let's say we have a query with 3 words and a document with 5 words and the vector dimension is 4 for the illustration.
Dot Product Calculation:
For each word in the query, compute the dot product with each word in the document. The dot product is calculated as:
Dot product scores:
After calculating the dot products for Query Token 1, the process is repeated for the other query tokens.
The resulting scores matrix will look like this:
scores: A matrix where each row corresponds to a query token, and each column represents the similarity with a document token. The values are the dot products (similarities) between the embeddings of the corresponding tokens.
Next Step: Max-Sim
On the scores matrix, we perform max-pooling along the document token dimension to retain only the highest similarity for each query token. This is done with:
max_scores = torch.max(scores, dim=1).values
Let’s break it down:
Max-pooling along document tokens: We take the maximum value from each row (which represents the similarity scores for one query token with all document tokens).
Result after Max-Pooling: After taking the max value for each query token, the resulting tensor is:
max_scores = torch.tensor([21.8, 16.1, 20.9])
This gives us the most relevant similarity score for each query token when matched against all the document tokens.
Final Step: Aggregating Scores
The next part of the code averages the maximum scores to produce a single relevance score for the entire query-document pair:
final_score = max_scores.mean() # Average of max scores
# in our case it looks like below.
final_score = (21.8 + 16.1 + 20.9) / 3 = 19.6
Interpretation:
Similarly, we calculate the score for all the documents and rank them based on the score.
I'd like to thank Leonie Monigatti , Jina AI for the excellent articles and blog post on the Late Interaction.
Feel free to leave your comments here, happy to discuss any thing related to tech.
#LateInteraction #InformationRetrieval #ColBERT #SemanticSearch #AIML #NLP
Engineer Senior
11moAn interesting approach and a clear explanation.