SlideShare a Scribd company logo
Interactive Q&A
10th December 2020
Question 1
General Considerations
Language Models
https://en.wikipedia.org/wiki/BERT_(language_model)
https://en.wikipedia.org/wiki/GPT-3
https://rajpurkar.github.io/SQuAD-explorer/
• Pre Trained on large corpora (expensive)
• Ad hoc fine tuning to solve Natural Language Tasks (inexpensive)
• Ability to encode terms and sentences as high dimensional vectors
e.g.
https://github.com/google-research/bert#pre-trained-models
https://github.com/hanxiao/bert-as-service/
Bert vectors for sentences [‘access the bank', ‘walking by the street', ‘tigers are big cats'] :
[[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174
-0.31721866]
[ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
-0.11345179]
[ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536
0.07640187]]
General Considerations
Language Models in Search
• Indexing Time : encode sentences (or full field contents) and store the vectors
• Searching Time: encode the query
• Score the query-document vectors pair, calculating vector distance/similarity:
Euclidean distance
Cosine Similarity
…
Limitations
• Rank entire corpus of documents ? Apply an (Approximate) Nearest Neighbour approach?
• Performance for embedding extraction?
• Un-intuitive results -> should be combined with Traditional Information Retrieval
• Explainability
Apache Lucene
Ideally you want to avoid scoring all documents of your corpus for your query.
The algorithms for vector retrieval can be roughly classified into four categories,
1. Tree-base algorithms, such as KD-tree;
2. Hashing methods, such as LSH (Local Sensitive Hashing);
3. Product quantization based algorithms, such as IVFFlat;
4. Graph-base algorithms, such as HNSW, SSG, NSG;
Specific File Format (Nov 2020)
•https://issues.apache.org/jira/browse/LUCENE-9004
Hierarchical Navigable Small World Graphs - DONE
•https://issues.apache.org/jira/browse/LUCENE-9322
DONE Unified Vector Format
•https://issues.apache.org/jira/browse/LUCENE-9136
IVFFlat - In Progress
Apache Lucene
Follow-ups
- reducing heap usage during graph construction
- adding a Query implementation
- exposing index hyper-parameters
- benchmarks
- testing on public datasets
- implementing a diversity heuristic for neighbour selection during graph construction
- making the graph hierarchical
- exploring more efficient search across multiple per-segment graphs…
Keep an eye on Lucene JIRA!
https://issues.apache.org/jira/browse/LUCENE-9004
Apache Solr
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://issues.apache.org/jira/browse/SOLR-14397 -> develop an official solution out of the box
https://issues.apache.org/jira/browse/SOLR-12890 -> summary
Ready to use Approaches
• Vector Scoring using Streaming Expressions (Point Fields)
• Available Solr Vector Search Plugin - https://github.com/saaay71/solr-vector-scoring (Payloads)
https://medium.com/@dmitry.kan/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
• Available Solr Vector Search Plugin with LSH Hashing (Payloads)
Limitations
• Generally slow solutions
• Re-use data structures, not using ad hoc codecs/file format
• Generally support only one vector per field
Apache Solr - Streaming Expressions
Index Time
<dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
Sample Docs:

curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food_collection/
update?commit=true --data-binary '
[
{"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
…
]
https://www.elastic.co/blog/lucene-points-6.0
org.apache.solr.schema.PointField
Multi Valued Field
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
Apache Solr - Streaming Expressions
Streaming Expression:

sort(
select(
search(food_collection,
q="*:*",
fl="id,vector_fs",
sort="id asc",
rows=3),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as sim,
id),
by="sim desc")
 
Response:
{
  "result-set": {
    "docs": [
        { "sim": 0.99996111, "id": "1" },
        { "sim": 0.98590279, "id": "10" },
        { "sim": 0.55566643, "id": "2" },
        { "EOF": true, "RESPONSE_TIME": 10 }
    ]
  }
} https://lucene.apache.org/solr/guide/8_7/vector-math.html
Drawbacks: 

1) it doesn’t apply to normal search
-> you need to use Streaming
Expressions

2) Requires traversing all vectors
and scoring them.

3) no support for multiple vectors
per field - SOLR-11077

Query Time
Apache Solr - Solr Vector Search Plugin
<fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true"
stored="true" termPayloads="true" termPositions="true" termVectors="true"
storeOffsetsWithPositions="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldType>
<field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true"
termPositions="true" termVectors="true" multiValued="true"/>
Index Time
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-name}/update?
commit=true --data-binary '
[
{"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "},
{"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "},
…
]'
Apache Solr - Solr Vector Search Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="0.1,4.75,0.3,1.2,0.7,4.0" cosine=false}
N.B. Adding the parameter cosine=false calculates the dot product
"response":{"numFound":6,"start":0,"maxScore":40.1675,"docs":[
{
"name":["example 3"],
"vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":40.1675},
{
"name":["example 0"],
"vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "],
"score":30.180502},
…
]}
Drawbacks: 

1) Payloads used for storing
vectors->

slow

2) Requires traversing all vectors
and scoring them.

3) support for multiple vectors per
field must be investigated

N.B. https://github.com/DmitryKey/solr-vector-scoring is a fork with a 8.6 Apache Solr port
Apache Solr - LSH Hashing Plugin
<fieldType name="VectorField" class="solr.BinaryField" stored="true" indexed="false" multiValued="false"/>
<field name="_vector_" type="VectorField" />
<field name="_lsh_hash_" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="vector" type="string" indexed="true" stored="true"/>
Index Time
<updateRequestProcessorChain name="LSH">
<processor class="com.github.saaay71.solr.updateprocessor.LSHUpdateProcessorFactory" >
<int name="seed">5</int>
<int name="buckets">50</int>
<int name="stages">50</int>
<int name="dimensions">6</int>
<str name="field">vector</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-
name}/update?update.chain=LSH&commit=true --data-binary '
[{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"},
{"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}]'
Apache Solr - LSH Hashing Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"}
&fl=name,score,vector,_vector_,_lsh_hash_
"response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[
{
"id": "1",
"vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8",
"1_35",
"2_7",
…
"49_43"],
"score":36.65736}
]
Drawbacks: 

1) Performance must be
investigated, usage of binary fields
with encoded vectors

2) latest commit October 2018

Elasticsearch
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://github.com/elastic/elasticsearch/issues/42326 - Work in progress for covering Approximate Nearest Neighbour Techiques

Ready to use Approaches
• X-Pack enterprise features - https://www.elastic.co/guide/en/elasticsearch/reference/current/dens
vector.html
• https://github.com/alexklibisz/elastiknn
• https://github.com/opendistro-for-elasticsearch/k-NN
Limitations
• Performance must be investigated ( https://elastiknn.com/performance/ )
• Re-use data structures, not using ad hoc codecs/file format
• Supports only one vector per field
Elasticsearch - X-Pack
Index Time
PUT my-index-000001
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 3
},
“status" : {
"type" : "keyword"
}
}
}
}
PUT my-index-000001/_doc/1
{
"my_dense_vector": [0.5, 10, 6],
"status" : "published"
}
PUT my-index-000001/_doc/2
{
"my_dense_vector": [-0.5, 10, 10],
"status" : "published"
}
• N.B. Lucene latest codecs and file format not used yet, vectors are stored as binary doc values.
Elasticsearch - X-Pack
Query Time
N.B. various distance functions are supported
Drawbacks: 

1) Requires traversing all vectors
returned by initial query and scoring
them.

2) no support for multiple vectors
per field



GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
"status" : "published"
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
"params": {
"query_vector": [4, 3.4, -0.2]
}
}
}
}
}
Next Steps
● Keep an eye on our Blog: https://sease.io/blog, as more is coming!
Question 2
Learning to Rank Libraries
RankLib https://github.com/codelibs/ranklib
XGBoost (University of Washington) https://github.com/dmlc/xgboost
TensorFlow (Google) https://github.com/tensorflow/ranking
LigthGBM (Microsoft) https://github.com/Microsoft/LightGBM
CatBoost (Yandex) https://github.com/catboost
SVMRank http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
LightFM https://github.com/lyst/lightfm
QuickRank (ISTI-CNR) https://github.com/hpclab/quickrank
JForests https://github.com/yasserg/jforests
Ranklib
Overview
https://sourceforge.net/p/lemur/wiki/RankLib/
RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been
implemented:
• MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]
• RankNet [1]
• RankBoost [2]
• AdaRank [3]
• Coordinate Ascent [4]
• LambdaMART [5]
• ListNet [7]
• Random Forests [8]
Ranklib
Our Experience
https://sourceforge.net/p/lemur/wiki/RankLib/
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Command Line Interface application -> not meant to be integrated with other apps
• Java code, minimal Test Coverage
• Svn (there’s a Github port, not official: https://github.com/codelibs/ranklib )
• Small Community
XGBoost
Overview
https://github.com/dmlc/xgboost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
flexible and portable.
• It implements machine learning algorithms under the Gradient Boosting framework.
• XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data
science problems in a fast and accurate way.
• The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problem
beyond billions of examples.
XGBoost
Our Experience
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Library easy to integrate
• Python code, huge project, Tests seem fair
• Github (https://github.com/dmlc/xgboost )
• Extremely popular
• Huge Community
Learning to Rank Libraries
Limitations:
‣ Developed for small data sets
‣ Limited support for Sparse Features
‣ Require extensive Feature Engineering
‣ Do not support the recent advances in Unbiased Learning-to-rank
The TensorFlow Ranking library addresses these gaps
TensorFlow Ranking
Overview
‣ Open source library for solving large-scale ranking problems in a deep learning framework
‣ Developed by Google’s AI department
‣ Fast and easy to use
‣ Flexible and highly configurable
‣ Support Pointwise, Pairwise, and Listwise losses
‣ Support popular ranking metrics like Mean Reciprocal Rank (MRR) and Normalized
Discounted Cumulative Gain (NDCG)
GitHub: https://github.com/tensorflow/ranking
TensorFlow Ranking
Additional components:
‣ Fully integrated with the rest of the TensorFlow ecosystem
‣ Can handle textual features using Text Embeddings
‣ Multi-item (also known as Groupwise) scoring functions
‣ LambdaLoss implementation
‣ Unbiased Learning-to-Rank
TF-Ranking Article: https://arxiv.org/abs/1812.00073
XGBoost vs TensorFlow
XGBoost TensorFlow
Tree-based Ranker Neural Ranker
Handle Missing Values Handle Missing Values
Run Efficiently on CPU Run Efficiently on CPU
Large Scale Training Large Scale Training
Main Differences

More Related Content

PDF
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
PDF
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
PDF
Haystack London - Search Quality Evaluation, Tools and Techniques
PDF
From Academic Papers To Production : A Learning To Rank Story
PDF
Search Quality Evaluation: a Developer Perspective
PDF
A Learning to Rank Project on a Daily Song Ranking Problem
PDF
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
PDF
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Haystack London - Search Quality Evaluation, Tools and Techniques
From Academic Papers To Production : A Learning To Rank Story
Search Quality Evaluation: a Developer Perspective
A Learning to Rank Project on a Daily Song Ranking Problem
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation

What's hot (17)

PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
PDF
How to Build your Training Set for a Learning To Rank Project - Haystack
PDF
Explainability for Learning to Rank
PDF
Lucene And Solr Document Classification
PDF
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
PDF
Advanced Document Similarity With Apache Lucene
PDF
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
PDF
Rated Ranking Evaluator (FOSDEM 2019)
PDF
How to Build your Training Set for a Learning To Rank Project
PDF
Enterprise Search – How Relevant Is Relevance?
PDF
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
PDF
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
PDF
Feature Extraction for Large-Scale Text Collections
PDF
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
PDF
Webinar: Simpler Semantic Search with Solr
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
How to Build your Training Set for a Learning To Rank Project - Haystack
Explainability for Learning to Rank
Lucene And Solr Document Classification
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Advanced Document Similarity With Apache Lucene
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Rated Ranking Evaluator (FOSDEM 2019)
How to Build your Training Set for a Learning To Rank Project
Enterprise Search – How Relevant Is Relevance?
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Feature Extraction for Large-Scale Text Collections
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Semantic & Multilingual Strategies in Lucene/Solr
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Webinar: Simpler Semantic Search with Solr
Ad

Similar to Interactive Questions and Answers - London Information Retrieval Meetup (20)

PDF
Graphs, Graphs everywhere - Lucene powered relation exploration
PPTX
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
PPTX
Typesafe spark- Zalando meetup
PDF
New-Age Search through Apache Solr
PDF
Staying Sane with Drupal NEPHP
PPTX
Benchmarking Solr Performance at Scale
PDF
pull requests I sent to scala/scala (ny-scala 2019)
PDF
Make your gui shine with ajax solr
PDF
NLP Project Full Circle
PPTX
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
PPTX
Backbonejs for beginners
PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
KEY
DjangoCon 2010 Scaling Disqus
PDF
Scala Frustrations
PDF
Apache Spark v3.0.0
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PPTX
ETL with SPARK - First Spark London meetup
PDF
Appsec usa2013 js_libinsecurity_stefanodipaola
ZIP
Rails and alternative ORMs
Graphs, Graphs everywhere - Lucene powered relation exploration
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Typesafe spark- Zalando meetup
New-Age Search through Apache Solr
Staying Sane with Drupal NEPHP
Benchmarking Solr Performance at Scale
pull requests I sent to scala/scala (ny-scala 2019)
Make your gui shine with ajax solr
NLP Project Full Circle
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Backbonejs for beginners
2021 04-20 apache arrow and its impact on the database industry.pptx
Solr Search Engine: Optimize Is (Not) Bad for You
DjangoCon 2010 Scaling Disqus
Scala Frustrations
Apache Spark v3.0.0
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
ETL with SPARK - First Spark London meetup
Appsec usa2013 js_libinsecurity_stefanodipaola
Rails and alternative ORMs
Ad

More from Sease (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
PPTX
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
PPTX
From Natural Language to Structured Solr Queries using LLMs
PPTX
Hybrid Search With Apache Solr
PPTX
Multi Valued Vectors Lucene
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
PPTX
How does ChatGPT work: an Information Retrieval perspective
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PPTX
Neural Search Comes to Apache Solr
PPTX
Large Scale Indexing
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
PPTX
How to cache your searches_ an open source implementation.pptx
PDF
Online Testing Learning to Rank with Solr Interleaving
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Building Search Using OpenSearch: Limitations and Workarounds
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
From Natural Language to Structured Solr Queries using LLMs
Hybrid Search With Apache Solr
Multi Valued Vectors Lucene
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
How To Implement Your Online Search Quality Evaluation With Kibana
Introducing Multi Valued Vectors Fields in Apache Lucene
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
How does ChatGPT work: an Information Retrieval perspective
How To Implement Your Online Search Quality Evaluation With Kibana
Neural Search Comes to Apache Solr
Large Scale Indexing
Dense Retrieval with Apache Solr Neural Search.pdf
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
How to cache your searches_ an open source implementation.pptx
Online Testing Learning to Rank with Solr Interleaving

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
CIFDAQ's Market Insight: SEC Turns Pro Crypto
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Interactive Questions and Answers - London Information Retrieval Meetup

  • 3. General Considerations Language Models https://en.wikipedia.org/wiki/BERT_(language_model) https://en.wikipedia.org/wiki/GPT-3 https://rajpurkar.github.io/SQuAD-explorer/ • Pre Trained on large corpora (expensive) • Ad hoc fine tuning to solve Natural Language Tasks (inexpensive) • Ability to encode terms and sentences as high dimensional vectors e.g. https://github.com/google-research/bert#pre-trained-models https://github.com/hanxiao/bert-as-service/ Bert vectors for sentences [‘access the bank', ‘walking by the street', ‘tigers are big cats'] : [[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174 -0.31721866] [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355 -0.11345179] [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536 0.07640187]]
  • 4. General Considerations Language Models in Search • Indexing Time : encode sentences (or full field contents) and store the vectors • Searching Time: encode the query • Score the query-document vectors pair, calculating vector distance/similarity: Euclidean distance Cosine Similarity … Limitations • Rank entire corpus of documents ? Apply an (Approximate) Nearest Neighbour approach? • Performance for embedding extraction? • Un-intuitive results -> should be combined with Traditional Information Retrieval • Explainability
  • 5. Apache Lucene Ideally you want to avoid scoring all documents of your corpus for your query. The algorithms for vector retrieval can be roughly classified into four categories, 1. Tree-base algorithms, such as KD-tree; 2. Hashing methods, such as LSH (Local Sensitive Hashing); 3. Product quantization based algorithms, such as IVFFlat; 4. Graph-base algorithms, such as HNSW, SSG, NSG; Specific File Format (Nov 2020) •https://issues.apache.org/jira/browse/LUCENE-9004 Hierarchical Navigable Small World Graphs - DONE •https://issues.apache.org/jira/browse/LUCENE-9322 DONE Unified Vector Format •https://issues.apache.org/jira/browse/LUCENE-9136 IVFFlat - In Progress
  • 6. Apache Lucene Follow-ups - reducing heap usage during graph construction - adding a Query implementation - exposing index hyper-parameters - benchmarks - testing on public datasets - implementing a diversity heuristic for neighbour selection during graph construction - making the graph hierarchical - exploring more efficient search across multiple per-segment graphs… Keep an eye on Lucene JIRA! https://issues.apache.org/jira/browse/LUCENE-9004
  • 7. Apache Solr Status of Deep Learning Vector Based Search • Lucene latest codecs and file format not used yet https://issues.apache.org/jira/browse/SOLR-14397 -> develop an official solution out of the box https://issues.apache.org/jira/browse/SOLR-12890 -> summary Ready to use Approaches • Vector Scoring using Streaming Expressions (Point Fields) • Available Solr Vector Search Plugin - https://github.com/saaay71/solr-vector-scoring (Payloads) https://medium.com/@dmitry.kan/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559 • Available Solr Vector Search Plugin with LSH Hashing (Payloads) Limitations • Generally slow solutions • Re-use data structures, not using ad hoc codecs/file format • Generally support only one vector per field
  • 8. Apache Solr - Streaming Expressions Index Time <dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/> Sample Docs: curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food_collection/ update?commit=true --data-binary ' [ {"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]}, {"id": "2", "name_s":"apple juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]}, … ] https://www.elastic.co/blog/lucene-points-6.0 org.apache.solr.schema.PointField Multi Valued Field <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
  • 9. Apache Solr - Streaming Expressions Streaming Expression: sort( select( search(food_collection, q="*:*", fl="id,vector_fs", sort="id asc", rows=3), cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as sim, id), by="sim desc")   Response: {   "result-set": {     "docs": [         { "sim": 0.99996111, "id": "1" },         { "sim": 0.98590279, "id": "10" },         { "sim": 0.55566643, "id": "2" },         { "EOF": true, "RESPONSE_TIME": 10 }     ]   } } https://lucene.apache.org/solr/guide/8_7/vector-math.html Drawbacks: 1) it doesn’t apply to normal search -> you need to use Streaming Expressions 2) Requires traversing all vectors and scoring them. 3) no support for multiple vectors per field - SOLR-11077
 Query Time
  • 10. Apache Solr - Solr Vector Search Plugin <fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true" stored="true" termPayloads="true" termPositions="true" termVectors="true" storeOffsetsWithPositions="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/> </analyzer> </fieldType> <field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true" termPositions="true" termVectors="true" multiValued="true"/> Index Time curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-name}/update? commit=true --data-binary ' [ {"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "}, {"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "}, … ]'
  • 11. Apache Solr - Solr Vector Search Plugin Query Time http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0" cosine=false} N.B. Adding the parameter cosine=false calculates the dot product "response":{"numFound":6,"start":0,"maxScore":40.1675,"docs":[ { "name":["example 3"], "vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "], "score":40.1675}, { "name":["example 0"], "vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "], "score":30.180502}, … ]} Drawbacks: 1) Payloads used for storing vectors->
 slow 2) Requires traversing all vectors and scoring them. 3) support for multiple vectors per field must be investigated
 N.B. https://github.com/DmitryKey/solr-vector-scoring is a fork with a 8.6 Apache Solr port
  • 12. Apache Solr - LSH Hashing Plugin <fieldType name="VectorField" class="solr.BinaryField" stored="true" indexed="false" multiValued="false"/> <field name="_vector_" type="VectorField" /> <field name="_lsh_hash_" type="string" indexed="true" stored="true" multiValued="true"/> <field name="vector" type="string" indexed="true" stored="true"/> Index Time <updateRequestProcessorChain name="LSH"> <processor class="com.github.saaay71.solr.updateprocessor.LSHUpdateProcessorFactory" > <int name="seed">5</int> <int name="buckets">50</int> <int name="stages">50</int> <int name="dimensions">6</int> <str name="field">vector</str> </processor> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection- name}/update?update.chain=LSH&commit=true --data-binary ' [{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"}, {"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}]'
  • 13. Apache Solr - LSH Hashing Plugin Query Time http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"} &fl=name,score,vector,_vector_,_lsh_hash_ "response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[ { "id": "1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33", "_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==", "_lsh_hash_":["0_8", "1_35", "2_7", … "49_43"], "score":36.65736} ] Drawbacks: 1) Performance must be investigated, usage of binary fields with encoded vectors
 2) latest commit October 2018

  • 14. Elasticsearch Status of Deep Learning Vector Based Search • Lucene latest codecs and file format not used yet https://github.com/elastic/elasticsearch/issues/42326 - Work in progress for covering Approximate Nearest Neighbour Techiques Ready to use Approaches • X-Pack enterprise features - https://www.elastic.co/guide/en/elasticsearch/reference/current/dens vector.html • https://github.com/alexklibisz/elastiknn • https://github.com/opendistro-for-elasticsearch/k-NN Limitations • Performance must be investigated ( https://elastiknn.com/performance/ ) • Re-use data structures, not using ad hoc codecs/file format • Supports only one vector per field
  • 15. Elasticsearch - X-Pack Index Time PUT my-index-000001 { "mappings": { "properties": { "my_vector": { "type": "dense_vector", "dims": 3 }, “status" : { "type" : "keyword" } } } } PUT my-index-000001/_doc/1 { "my_dense_vector": [0.5, 10, 6], "status" : "published" } PUT my-index-000001/_doc/2 { "my_dense_vector": [-0.5, 10, 10], "status" : "published" } • N.B. Lucene latest codecs and file format not used yet, vectors are stored as binary doc values.
  • 16. Elasticsearch - X-Pack Query Time N.B. various distance functions are supported Drawbacks: 1) Requires traversing all vectors returned by initial query and scoring them. 2) no support for multiple vectors per field
 
 GET my-index-000001/_search { "query": { "script_score": { "query" : { "bool" : { "filter" : { "term" : { "status" : "published" } } } }, "script": { "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0", "params": { "query_vector": [4, 3.4, -0.2] } } } } }
  • 17. Next Steps ● Keep an eye on our Blog: https://sease.io/blog, as more is coming!
  • 19. Learning to Rank Libraries RankLib https://github.com/codelibs/ranklib XGBoost (University of Washington) https://github.com/dmlc/xgboost TensorFlow (Google) https://github.com/tensorflow/ranking LigthGBM (Microsoft) https://github.com/Microsoft/LightGBM CatBoost (Yandex) https://github.com/catboost SVMRank http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html LightFM https://github.com/lyst/lightfm QuickRank (ISTI-CNR) https://github.com/hpclab/quickrank JForests https://github.com/yasserg/jforests
  • 20. Ranklib Overview https://sourceforge.net/p/lemur/wiki/RankLib/ RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been implemented: • MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6] • RankNet [1] • RankBoost [2] • AdaRank [3] • Coordinate Ascent [4] • LambdaMART [5] • ListNet [7] • Random Forests [8]
  • 21. Ranklib Our Experience https://sourceforge.net/p/lemur/wiki/RankLib/ • Multiple learning to rank libraries supported including LambdaMART • Relatively easy to use • Command Line Interface application -> not meant to be integrated with other apps • Java code, minimal Test Coverage • Svn (there’s a Github port, not official: https://github.com/codelibs/ranklib ) • Small Community
  • 22. XGBoost Overview https://github.com/dmlc/xgboost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. • It implements machine learning algorithms under the Gradient Boosting framework. • XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. • The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problem beyond billions of examples.
  • 23. XGBoost Our Experience • Multiple learning to rank libraries supported including LambdaMART • Relatively easy to use • Library easy to integrate • Python code, huge project, Tests seem fair • Github (https://github.com/dmlc/xgboost ) • Extremely popular • Huge Community
  • 24. Learning to Rank Libraries Limitations: ‣ Developed for small data sets ‣ Limited support for Sparse Features ‣ Require extensive Feature Engineering ‣ Do not support the recent advances in Unbiased Learning-to-rank The TensorFlow Ranking library addresses these gaps
  • 25. TensorFlow Ranking Overview ‣ Open source library for solving large-scale ranking problems in a deep learning framework ‣ Developed by Google’s AI department ‣ Fast and easy to use ‣ Flexible and highly configurable ‣ Support Pointwise, Pairwise, and Listwise losses ‣ Support popular ranking metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) GitHub: https://github.com/tensorflow/ranking
  • 26. TensorFlow Ranking Additional components: ‣ Fully integrated with the rest of the TensorFlow ecosystem ‣ Can handle textual features using Text Embeddings ‣ Multi-item (also known as Groupwise) scoring functions ‣ LambdaLoss implementation ‣ Unbiased Learning-to-Rank TF-Ranking Article: https://arxiv.org/abs/1812.00073
  • 27. XGBoost vs TensorFlow XGBoost TensorFlow Tree-based Ranker Neural Ranker Handle Missing Values Handle Missing Values Run Efficiently on CPU Run Efficiently on CPU Large Scale Training Large Scale Training Main Differences