Comparison of different approaches to search huge human-readable contents, using traditional full text search approach, LLM-powered "AI search", how do they work under the hood and how to combine best of both worlds.
3. Full Text Search
Vector Search
Tips & Tricks
Hybrid Search
AGENDA
4. FULL TEXT INDEX
lexically ordered
list of “pages” containing the word
“terms” not words
“stop words” not included
5. PROBLEMS
WITH FULL
TEXT
SEARCH
1. Stemming is language-specific –
in .NET use NTextCat
2. Typographic errors – use n-gram
similarity
3. Different words, same meaning
4. Text only
14. EUCLIDEAN DISTANCE
• Values: [0, +Inf) (smaller is better)
• Fast
• Anomaly and fraud detection
a
b
𝑑(𝑎 ,𝑏)=√(𝑎 1−𝑏1)2+(𝑎2−𝑏2)2…+(𝑎𝑛−𝑏𝑛)2
15. DOT PRODUCT SIMILARITY
• Value: (-Inf, +Inf) (bigger is better)
• Fast
• Image retrieval and matching
• Music recommendation
a
b
𝜕
𝑎∙𝑏=|𝑎|∨
b∨
cos𝜕
16. COSINE DISTANCE
• Value: [-1.0, 1.0] (bigger is better)
• Slow
• Text document similarity
• Recommendation systems
a
b
𝜕
28. HNSW IN
SQLITE
.load ./vec0
CREATE VIRTUAL TABLE document_embeddings USING vec0(
embedding FLOAT[768]
);
-- query
SELECT
rowid,
distance
FROM document_embeddings
WHERE embedding MATCH ‘[0.83443, 0.15224, …]’
ORDER BY distance
29. HNSW IN
POSTGRESQL
CREATE EXTENSION vector;
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
embedding VECTOR(768)
);
-- create index
CREATE INDEX ON documents
USING hnsw(embedding vector_cosine_ops)
WITH (m=16, ef_construction=64);
-- query
SELECT
id,
embedding <=> ‘[0.83443, 0.15224, …]’ AS distance
FROM documents
WHERE distance > 0.7
ORDER BY distance DESC
34. MULTI VECTOR INDEX
1. Cut content into paragraphs
2. Group paragraphs into blocks by max allowed
size (ie. 8000 chars).
35. MULTI VECTOR INDEX
1. Cut content into paragraphs
2. Group paragraphs into blocks by max allowed
size (ie. 8000 chars).
3. Use content address hashing to identify blocks
0xae12e7
0x3902a1
0xef7312
0x06cd01
36. MULTI VECTOR INDEX
1. Cut content into paragraphs
2. Group paragraphs into blocks by max allowed
size (ie. 8000 chars).
3. Use content address hashing to identify blocks
4. Reindex block only when sufficient change (i.e.
>15%) was made.
0xae12e7
0x3902a1
0x43bf01
0x06cd01
37. MULTI VECTOR INDEX
1. Cut content into paragraphs
2. Group paragraphs into blocks by max allowed
size (ie. 8000 chars).
3. Use content address hashing to identify blocks
4. Reindex block only when sufficient change (i.e.
>15%) was made.
5. If block went under min allowed size (ie. 4000
chars), stich it to smallest adjacent block
0xae12e7
0x3902a1
0x9e721c
38. MULTI VECTOR INDEX
1. Cut content into paragraphs
2. Group paragraphs into blocks by max allowed
size (ie. 8000 chars).
3. Use content address hashing to identify blocks
4. Reindex block only when sufficient change (i.e.
>15%) was made.
5. If block went under min allowed size (ie. 4000
chars), stich it to smallest adjacent block
6. If block went over max allowed size, split it by
paragraphs into two halves
0xae12e7
0x3902a1
0x12e757
0x49bcd1
44. RECIPROCAL RANK FUSION
Query
full text search
vector search
1 A
2 B
3 C
4 D
5 E
1 C
2 B
3 F
4 A
5 D
1
1+𝑟𝑎𝑛𝑘𝐹𝑇𝑆
+
1
1+𝑟𝑎𝑛𝑘𝑉𝑆
0.75 C
0.70 A
0.66 B
0.37 D
0.25 F
0.17 E