Interactive Questions and Answers - London Information Retrieval Meetup

Interactive Q&A
10th December 2020

General Considerations
Language Models
https://en.wikipedia.org/wiki/BERT_(language_model)
https://en.wikipedia.org/wiki/GPT-3
https://rajpurkar.github.io/SQuAD-explorer/
• Pre Trained on large corpora (expensive)
• Ad hoc fine tuning to solve Natural Language Tasks (inexpensive)
• Ability to encode terms and sentences as high dimensional vectors
e.g.
https://github.com/google-research/bert#pre-trained-models
https://github.com/hanxiao/bert-as-service/
Bert vectors for sentences [‘access the bank', ‘walking by the street', ‘tigers are big cats'] :
[[ 0.13186474 0.32404128 -0.82704437 ... -0.3711958 -0.39250174
-0.31721866]
[ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
-0.11345179]
[ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366 -0.39310536
0.07640187]]

General Considerations
Language Models in Search
• Indexing Time : encode sentences (or full field contents) and store the vectors
• Searching Time: encode the query
• Score the query-document vectors pair, calculating vector distance/similarity:
Euclidean distance
Cosine Similarity
…
Limitations
• Rank entire corpus of documents ? Apply an (Approximate) Nearest Neighbour approach?
• Performance for embedding extraction?
• Un-intuitive results -> should be combined with Traditional Information Retrieval
• Explainability

Apache Lucene
Ideally you want to avoid scoring all documents of your corpus for your query.
The algorithms for vector retrieval can be roughly classified into four categories,
1. Tree-base algorithms, such as KD-tree;
2. Hashing methods, such as LSH (Local Sensitive Hashing);
3. Product quantization based algorithms, such as IVFFlat;
4. Graph-base algorithms, such as HNSW, SSG, NSG;
Specific File Format (Nov 2020)
•https://issues.apache.org/jira/browse/LUCENE-9004
Hierarchical Navigable Small World Graphs - DONE
DONE Unified Vector Format
IVFFlat - In Progress

Apache Lucene
Follow-ups
- reducing heap usage during graph construction
- adding a Query implementation
- exposing index hyper-parameters
- benchmarks
- testing on public datasets
- implementing a diversity heuristic for neighbour selection during graph construction
- making the graph hierarchical
- exploring more efficient search across multiple per-segment graphs…
Keep an eye on Lucene JIRA!
https://issues.apache.org/jira/browse/LUCENE-9004

Apache Solr
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://issues.apache.org/jira/browse/SOLR-14397 -> develop an official solution out of the box
https://issues.apache.org/jira/browse/SOLR-12890 -> summary
Ready to use Approaches
• Vector Scoring using Streaming Expressions (Point Fields)
• Available Solr Vector Search Plugin - https://github.com/saaay71/solr-vector-scoring (Payloads)
https://medium.com/@dmitry.kan/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
• Available Solr Vector Search Plugin with LSH Hashing (Payloads)
Limitations
• Generally slow solutions
• Re-use data structures, not using ad hoc codecs/file format
• Generally support only one vector per field

Apache Solr - Streaming Expressions
Index Time
<dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
Sample Docs:

curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/food_collection/
update?commit=true --data-binary '
[
{"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
…
]
https://www.elastic.co/blog/lucene-points-6.0
org.apache.solr.schema.PointField
Multi Valued Field
<fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>

Apache Solr - Streaming Expressions
Streaming Expression:

sort(
select(
search(food_collection,
q="*:*",
fl="id,vector_fs",
sort="id asc",
rows=3),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as sim,
id),
by="sim desc")

Response:
{
"result-set": {
"docs": [
{ "sim": 0.99996111, "id": "1" },
{ "sim": 0.98590279, "id": "10" },
{ "sim": 0.55566643, "id": "2" },
{ "EOF": true, "RESPONSE_TIME": 10 }
]
}
} https://lucene.apache.org/solr/guide/8_7/vector-math.html
Drawbacks:

1) it doesn’t apply to normal search
-> you need to use Streaming
Expressions

2) Requires traversing all vectors
and scoring them.

3) no support for multiple vectors
per ﬁeld - SOLR-11077 
Query Time

Apache Solr - Solr Vector Search Plugin
<fieldType name="VectorField" class="solr.TextField" indexed="true" termOffsets="true"
stored="true" termPayloads="true" termPositions="true" termVectors="true"
storeOffsetsWithPositions="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="float"/>
</analyzer>
</fieldType>
<field name="vector" type="VectorField" indexed="true" termOffsets="true" stored="true"
termPositions="true" termVectors="true" multiValued="true"/>
Index Time
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-name}/update?
commit=true --data-binary '
[
{"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "},
{"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "},
…
]'

Apache Solr - Solr Vector Search Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="0.1,4.75,0.3,1.2,0.7,4.0" cosine=false}
N.B. Adding the parameter cosine=false calculates the dot product
"response":{"numFound":6,"start":0,"maxScore":40.1675,"docs":[
{
"name":["example 3"],
"vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":40.1675},
{
"name":["example 0"],
"vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "],
"score":30.180502},
…
]}
Drawbacks:

1) Payloads used for storing
vectors-> 
slow

and scoring them.

3) support for multiple vectors per
ﬁeld must be investigated 
N.B. https://github.com/DmitryKey/solr-vector-scoring is a fork with a 8.6 Apache Solr port

Apache Solr - LSH Hashing Plugin
<fieldType name="VectorField" class="solr.BinaryField" stored="true" indexed="false" multiValued="false"/>
<field name="_vector_" type="VectorField" />
<field name="_lsh_hash_" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="vector" type="string" indexed="true" stored="true"/>
Index Time
<updateRequestProcessorChain name="LSH">
<processor class="com.github.saaay71.solr.updateprocessor.LSHUpdateProcessorFactory" >
<int name="seed">5</int>
<int name="buckets">50</int>
<int name="stages">50</int>
<int name="dimensions">6</int>
<str name="field">vector</str>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
curl -X POST -H "Content-Type: application/json" http://localhost:8983/solr/{your-collection-
name}/update?update.chain=LSH&commit=true --data-binary '
[{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"},
{"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}]'

Apache Solr - LSH Hashing Plugin
Query Time
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp f=vector
vector="1.55,3.53,2.3,0.7,3.44,2.33" lsh="true" reRankDocs="5"}
&fl=name,score,vector,_vector_,_lsh_hash_
"response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[
{
"id": "1",
"vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8",
"1_35",
"2_7",
…
"49_43"],
"score":36.65736}
]
Drawbacks:

1) Performance must be
investigated, usage of binary ﬁelds
with encoded vectors 
2) latest commit October 2018

Elasticsearch
Status of Deep Learning Vector Based Search
• Lucene latest codecs and file format not used yet
https://github.com/elastic/elasticsearch/issues/42326 - Work in progress for covering Approximate Nearest Neighbour Techiques

Ready to use Approaches
• X-Pack enterprise features - https://www.elastic.co/guide/en/elasticsearch/reference/current/dens
vector.html
• https://github.com/alexklibisz/elastiknn
• https://github.com/opendistro-for-elasticsearch/k-NN
Limitations
• Performance must be investigated ( https://elastiknn.com/performance/ )
• Re-use data structures, not using ad hoc codecs/file format
• Supports only one vector per field

Elasticsearch - X-Pack
Index Time
PUT my-index-000001
{
"mappings": {
"properties": {
"my_vector": {
"type": "dense_vector",
"dims": 3
},
“status" : {
"type" : "keyword"
}
}
}
}
PUT my-index-000001/_doc/1
{
"my_dense_vector": [0.5, 10, 6],
"status" : "published"
}
PUT my-index-000001/_doc/2
{
"my_dense_vector": [-0.5, 10, 10],
}
• N.B. Lucene latest codecs and file format not used yet, vectors are stored as binary doc values.

Elasticsearch - X-Pack
Query Time
N.B. various distance functions are supported
Drawbacks:

returned by initial query and scoring
them.

2) no support for multiple vectors
per ﬁeld 
 
GET my-index-000001/_search
{
"query": {
"script_score": {
"query" : {
"bool" : {
"filter" : {
"term" : {
}
}
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
"params": {
"query_vector": [4, 3.4, -0.2]
}
}
}
}
}

Next Steps
● Keep an eye on our Blog: https://sease.io/blog, as more is coming!

Learning to Rank Libraries
RankLib https://github.com/codelibs/ranklib
XGBoost (University of Washington) https://github.com/dmlc/xgboost
TensorFlow (Google) https://github.com/tensorflow/ranking
LigthGBM (Microsoft) https://github.com/Microsoft/LightGBM
CatBoost (Yandex) https://github.com/catboost
SVMRank http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html
LightFM https://github.com/lyst/lightfm
QuickRank (ISTI-CNR) https://github.com/hpclab/quickrank
JForests https://github.com/yasserg/jforests

Ranklib
Overview
https://sourceforge.net/p/lemur/wiki/RankLib/
RankLib is a library of learning to rank algorithms. Currently eight popular algorithms have been
implemented:
• MART (Multiple Additive Regression Trees, a.k.a. Gradient boosted regression tree) [6]
• RankNet [1]
• RankBoost [2]
• AdaRank [3]
• Coordinate Ascent [4]
• LambdaMART [5]
• ListNet [7]
• Random Forests [8]

Ranklib
Our Experience
https://sourceforge.net/p/lemur/wiki/RankLib/
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Command Line Interface application -> not meant to be integrated with other apps
• Java code, minimal Test Coverage
• Svn (there’s a Github port, not ofﬁcial: https://github.com/codelibs/ranklib )
• Small Community

XGBoost
Overview
https://github.com/dmlc/xgboost
XGBoost is an optimized distributed gradient boosting library designed to be highly efﬁcient,
ﬂexible and portable.
• It implements machine learning algorithms under the Gradient Boosting framework.
• XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data
science problems in a fast and accurate way.
• The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problem
beyond billions of examples.

XGBoost
Our Experience
• Multiple learning to rank libraries supported including LambdaMART
• Relatively easy to use
• Library easy to integrate
• Python code, huge project, Tests seem fair
• Github (https://github.com/dmlc/xgboost )
• Extremely popular
• Huge Community

Learning to Rank Libraries
Limitations:
‣ Developed for small data sets
‣ Limited support for Sparse Features
‣ Require extensive Feature Engineering
‣ Do not support the recent advances in Unbiased Learning-to-rank
The TensorFlow Ranking library addresses these gaps

TensorFlow Ranking
Overview
‣ Open source library for solving large-scale ranking problems in a deep learning framework
‣ Developed by Google’s AI department
‣ Fast and easy to use
‣ Flexible and highly configurable
‣ Support Pointwise, Pairwise, and Listwise losses
‣ Support popular ranking metrics like Mean Reciprocal Rank (MRR) and Normalized
Discounted Cumulative Gain (NDCG)
GitHub: https://github.com/tensorflow/ranking

TensorFlow Ranking
Additional components:
‣ Fully integrated with the rest of the TensorFlow ecosystem
‣ Can handle textual features using Text Embeddings
‣ Multi-item (also known as Groupwise) scoring functions
‣ LambdaLoss implementation
‣ Unbiased Learning-to-Rank
TF-Ranking Article: https://arxiv.org/abs/1812.00073

XGBoost vs TensorFlow
XGBoost TensorFlow
Tree-based Ranker Neural Ranker
Handle Missing Values Handle Missing Values
Run Efficiently on CPU Run Efficiently on CPU
Large Scale Training Large Scale Training
Main Differences

Interactive Questions and Answers - London Information Retrieval Meetup

More Related Content

What's hot (17)

Similar to Interactive Questions and Answers - London Information Retrieval Meetup (20)

More from Sease (20)

Recently uploaded (20)

Interactive Questions and Answers - London Information Retrieval Meetup