Hybrid Search with Apache Solr Reciprocal Rank Fusion

Hybrid Search with Apache
Solr Reciprocal Rank Fusion
Speaker: Alessandro Benedetti, Director @ Sease
BERLIN BUZZWORDS 2024 - 10/06/2024

‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2

‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
Hot Trends :
● Large Language Models Applications
● Vector-based (Neural) Search
● Natural Language Processing
● Learning To Rank
● Document Similarity
● Search Quality Evaluation
● Relevance Tuning
SEArch SErvices
www.sease.io
3

Limitations of Vector-Based Search
Reciprocal Rank Fusion
APIs
Internals
Explainability
Overview

Vector-based (Neural) Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space

Low Explainability
● High Dimensionality - vectors are long sequences of numerical values (768,
1536 …)
● Dimensions - each feature (element in the vector) has no clear semantic in
many cases (slightly different from sparse vectors or explicit feature vectors)
● Values - It’s not obvious to estimate how a single value impact relevance
(higher is better?)
● Similarity - To explain why a search result is retrieved in the top-K the vector
distance is the only info you have
Research is happening but it’s still an open problem

Lexical matches?
● Search users still have the expectation of lexical matches to happen
● You can’t enforce that with Vector-based search
“Why the document with the keyword in the title
is not coming up?” cit.

Low Diversity
● vector-based search just returns the top-k ordered by
vector similarity
● Unless you add more logic on top, you would expect low
diversity by definition

Hybrid Search
● Mitigation of current vector-search problems - Is it here to stay?
● Combine traditional keyword-based (lexical) search with vector-based (neural)
search
● Retrieval of two sets of candidates:
○ one set of results coming from lexical matches with the query keywords
○ a set of results coming from the K-Nearest Neighbours search with the query
vector
● Ranking of the candidates

● Multiple ranked lists -> unified result set.
● reciprocal rank -> inverse of the rank of a document in a ranked list of search
results.
● higher ranking -> higher contribution
● RRF was introduced the first time by Cormack et al. in [1].
[1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd
international ACM SIGIR conference on Research and development in information retrieval (2009)

score = 0.0
for q in queries:
if d in result(q):
score += 1.0 / ( k + rank( result(q), d ) )
return score

● e.g.
● K=10
1 id10
2 id7
3 id9
4 id5
5 id3
1 id7
2 id5
3 id9
4 id4
5 id10
id7 ( 0.174 = 1/10+2 +1/10+1)
id10 ( 0.157 = 1/10+1 +1/10+5)
id5 ( 0.155 = 1/10+4 +1/10+2)
id9 ( 0.154 = 1/10+3 +1/10+3)
id4 ( 0.071 = 1/10+4)
id3 ( 0.666 = 1/10+5)

Open Pull Request
https://github.com/apache/solr/pull/2489
Challenges
N.B. If you are not a full-time contributor, the Lucene/Solr split hits
hard!

JSON APIs
{
"queries": {
"lexical1": {
"lucene": {
"query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
}
},
"lexical2": {
"lucene": {
"query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
}
}
},
"limit": 10,
"fields": "[id,score]",
"params": {
"combiner": true,
"combiner.upTo": 5
}
}
Trivia:
“queries” support was originally
introduced to be used in faceting for
domain/exclusion
https://issues.apache.org/jira/brows
e/SOLR-12490

JSON APIs
org.apache.solr.common.params.CombinerParams
String COMBINER = "combiner";
String COMBINER_ALGORITHM = COMBINER + ".algorithm";
String RECIPROCAl_RANK_FUSION = "rrf";
String COMBINER_UP_TO = COMBINER + ".upTo";
int COMBINER_UP_TO_DEFAULT = 100;
String COMBINER_RRF_K = COMBINER + "." + RECIPROCAl_RANK_FUSION + ".k";
int COMBINER_RRF_K_DEFAULT = 60; // from original paper

JSON Request
solr/core/src/java/org/apache/solr/request/json/RequestUtil.java
● Just store the query keys as a Solr param (so that we can access later)
{
"queries": {
"lexical1": {
…
},
"lexical2": {
…
}
},
…
}
[“lexical1”,”lexical2”]

Query Component Prepare
It has the scope to prepare the variables and structures to use at processing time
● get the queries to combine from params
○ save them (unparsed query string) in the response builder
● parse the queries
○ save them parsed in the response builder
○ save the query parsers (class name) in the response builder
org.apache.solr.handler.component.QueryComponent#prepare

Query Component Process
It has the scope to do the combination of search results (not breaking single query
scenarios!)
● get the parsed queries to combine from response builder
○ retrieve the result sets (one for each query)
● combine the results
○ save the combined set in the response builder
org.apache.solr.handler.component.QueryComponent#process

Queries Combiner
org.apache.solr.search.combining.QueriesCombiner
It has the scope to combine results sets, abstracting the algorithm
● init the combined query results
○ keeping track of additional attributes such as Partial Results
● offer abstract methods for
○ combination
○ explainability
public abstract QueryResult combine(QueryResult[] rankedLists);
public abstract NamedList<Explanation> getExplanations(
String[] queryKeys,
List<Query> queries,
List<DocList> resultsPerQuery,
SolrIndexSearcher searcher,
IndexSchema schema)

org.apache.solr.search.combining.ReciprocalRankFusion
It has the scope to implement the specific algorithm
● combine the search results (>=2)
○ keeping track of additional attributes such as Partial Results
● explain them
○ how the combination happened, based on the rankings
○ how the original score was calculated

Limitations
This is a first step, the current limitations include:
● Field collapsing/grouping is not supported
● Distributed search support is included but combination happens per node and
may not be ideal.

Debug Component
We have two layers of explainability:
● Query Parsing
● Results explanation
○ Combiner - how the multiple results sets were combined.
○ Original Scores - how the original scores were obtained

Query Debug
<lst name="debug">
<lst name="queriesToCombine">
<lst name="lexical">
<str name="querystring">id:(10^=2 OR 2^=1)</str>
<str name="queryparser">LuceneQParser</str>
<str name="parsedquery">ConstantScore(id:10)^2.0
ConstantScore(id:2)^1.0</str>
<str
name="parsedquery_toString">(ConstantScore(id:10))^2.0 (ConstantScore(id:2))^1.0</str>
</lst>
<lst name="vector-based">
<str name="querystring">[1.0, 2.0, 3.0, 4.0]</str>
<str name="queryparser">KnnQParser</str>
<str
name="parsedquery">KnnFloatVectorQuery(KnnFloatVectorQuery:vector[1.0,...][5])</str>
<str
name="parsedquery_toString">KnnFloatVectorQuery:vector[1.0,...][5]</str>
</lst>
</lst>

Results Debug
'0.032522473 =
1/(60+1) + 1/(60+2) because its ranks were:
1 for query(lexical), 2 for query(lexical2)'
+
Original Scores

Future Works
PR: https://github.com/apache/solr/pull/2489
- Distributed tests
- Documentation
—------------------------
- better distributed support
- field collapsing/grouping
- Additional algorithms?

THANK YOU!
@seaseltd @sease-
ltd
@seaseltd @sease_ltd

Hybrid Search with Apache Solr Reciprocal Rank Fusion

More Related Content

Similar to Hybrid Search with Apache Solr Reciprocal Rank Fusion (20)

More from Sease (20)

Recently uploaded (20)

Hybrid Search with Apache Solr Reciprocal Rank Fusion