SlideShare a Scribd company logo
Activate Conference - 2019
Custom Tanimoto & Cosine
Similarity Vector Operator
VECTOR
OPERATORS
FOR SOLR
Accenture at-a-glance
459,000
Employees
16
Major Industry
Segments
52
Countries
operating in
Global
Delivery Network
$41bn rev
10.3% growth
in local currency
$1bn+
in V&A
$1.7bn
Investment on
R&D and Training
100%
of leading industry analyst
reports rank us in leader category
3K+
data scientists
1,600+
Patents and
patents-pending
100+
alliance partners
20+
innovation centers
That’s
Applied
Intelligence
20K+
professionals
20+
years of advanced
analytics experience
6K+
deep AI experts
250+
apps and solutions
Search & Content Analytics
Number of projects
by Search Engine Technology
Unstructured
Content
Search
Natural
Language
High-Volume,
AdHoc Analytics
San Diego
Bracknell UK
Prague CZ
Washington DC
(HQ)
Frankfurt DE
Manila, PH
21 Search Engine Architects
154 Search Engine Engineers
San Jose, CR
Near Shore Development Center:
20+ Deep Search and
Big Data Experts
Near Shore Development Center:
60+ Deep Search and
Big Data Experts
2 0 0 5
200 Employees
800+ Customers
STARTED
Why Create a Vector Operator?
You have vectors to search
“The only good score is a percentage score”
Your documents have intimate relations with
Neural Network embedding vectors
You want to precompute all your term weights,
because you don’t trust anyone else to do it right
We can think of four reasons…
For SOC O*NET Codes…
Example Vector
income tax advisor:0.22662242;tax advisor:0.22662242;tax consultant:0.22662242;
licensed tax consultant:0.22662242;tax evaluator:0.22662242;
income tax consultant:0.22662242;certified income tax preparer:0.22662242;
income tax expert:0.22662242;enrolled agent:0.22662242;
income tax preparer:0.22662242;corporate tax preparer:0.22662242;
master tax advisor:0.22662242;tax form:0.22317137;tax specialist:0.20433334;
CTP:0.20433334;tax preparer:0.20433334;tax manager:0.19126709;
tax returns:0.18835443;tax associate:0.18198828;interview clients:0.13498485;
data input:0.11331121;state tax:0.11331121;income statements:0.11331121;
answer questions:0.10770458;tax planning:0.10216667;detect errors:0.09099414;
deductions:0.08743438;data entry:0.07979348;financial information:0.04895036
Create a custom analyzer
Step 1: Indexing Vectors
How to index weight values?
– Option 1: use position data (easiest, most efficient)
– Option 2: use payload data
But, when using Position data…
Must be in order (lowest to highest)
Vector: “hello:0.5;world:0.4;love:0.3” Analyzer Index
Text Weight
hello 0.5
world 0.4
love 0.3
Code sample
Step 1:
Indexing
Vectors
@Override
public final boolean incrementToken() throws IOException {
.
. // READ TOKEN TEXT:
.
while( (c = this.input.read()) != -1 && vectorStringSize < MAX_VECTOR_CHARS)
vectorString[vectorStringSize++] = (char)c;
.
. // EXTRACT THE NEXT TOKEN
.
termAtt.setEmpty();
// Loop through characters in vectorString[] to fetch characters of each token
termAtt.append(tokenChar);
termAtt.setLength(numTokenChars); // trim trailing whitespace
.
. // Extract the weight
.
// Loop through characters in vectorString[] to fetch characters of the weight
// Convert to weight value
/// Output the new token
int newPosition = (int)(weight*1000.0 + 0.5);
posIncrAtt.setPositionIncrement(minZero(newPosition - curPosition));
curPosition = newPosition;
sumOfSquares += weight*weight;
return true;
}
DEMO!
Create a custom term operator
Step 2: Vector Term Operator
Query: hello:0.5 VectorTermQuery VectorTermWeight VectorTermScorer
Index Stats
Lucene Weight:
Compute values that are
constant for all documents,
e.g. IDF
Lucene Scorer:
Match documents &
compute scores,
document by document
Lucene Query:
Represent query,
Independent of the
index
Index Postings
Code Sample for
Query & Weight
Step 2:
Vector
Term
Operator
public class VectorTermQuery extends Query implements NormalizedBoostType {
public VectorTermQuery(SpanQuery query, float outputBoost) { . . . }
@Override
public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode,
float boost) throws IOException {
final SpanWeight tmp_innerWeight =
(SpanWeight)searcher.createWeight(query, ScoreMode.COMPLETE, 1f);
return new VectorTermWeight(this, tmp_innerWeight, this.outputBoost);
}
public class VectorTermWeight extends Weight implements NormalizedBoostType{
protected VectorTermWeight(Query query, SpanWeight innerWeight,
float outputBoost) { . . . }
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
ScorerSupplier innerScorerSupplier = innerWeight.scorerSupplier(context);
if (innerScorerSupplier == null) return null;
final SpanScorer innerScorer =
(SpanScorer)innerScorerSupplier.get(Long.MAX_VALUE);
return new VectorTermScorer(innerWeight, innerScorer, outputBoost);
}
}
Code Sample for
Scorer
Step 2:
Vector
Term
Operator
public final class VectorTermScorer extends Scorer implements NormalizedBoostType {
public VectorTermScorer(Weight weight, SpanScorer innerScorer,
float outputBoost) { . . . }
@Override
public TwoPhaseIterator twoPhaseIterator() {
return innerScorer.twoPhaseIterator();
}
@Override
public DocIdSetIterator iterator() {
return innerScorer.iterator();
}
@Override
public int docID() {
return innerScorer.docID();
}
@Override
public float score() throws IOException {
Spans spans = innerScorer.getSpans();
int startPos = spans.nextStartPosition()+1;
return outputBoost*startPos/1000.0F;
}
@Override
public float getBoost() {
return outputBoost;
}
}`
DEMO!
Multiple vector terms = A vector!
Step 3: The Vector Operator
Vector
george martha washington custis
0.10 0.65 0.90
𝑄 ∙ 𝐷
𝑄 2 + 𝐷 2 − 𝑄 ∙ 𝐷
INDEX
0.5 0.7 0.1 0.75Query Weights:
Document Weights:
Document
Magnitude
1.3
Query Magnitude =
0.52 + 0.72 + 0.12 + 0.752
= 1.146
Tanimoto similarity:
Q·D = 0.5*0.1 + 0.7*0.65 + 0.75*0.9
= 1.18
Final Score =
1.18 / (1.1462 + 1.32 – 1.18)
= 0.647
Lucene “Pulls” documents from Scorers
TERM(george) TERM(washington)
TERM(president)
AND
OR
INDEX
next()
Code Sample for
Vector Query
Step 3:
Vector
Operator
public final class VectorQuery extends Query implements Iterable<Query> {
public VectorQuery(Collection<Query> queryClauses) {
.
. // Compute the magnitude of the query vector
.
double sumOfSquares = 0.0;
for(Query q : queryClauses) {
float clauseBoost = q.getBoost();
sumOfSquares += clauseBoost*clauseBoost;
}
this.queryVectorMagnitude = Math.sqrt(sumOfSquares);
}
@Override
public Weight createWeight(IndexSearcher searcher,
ScoreMode scoreMode, float boost) throws IOException {
ArrayList<Weight> weightClauses = new ArrayList<Weight>(queryClauses.length);
for (Query queryClause : queryClauses) {
weightClauses.add(searcher.createWeight(queryClause, scoreMode, boost));
}
Weight magnitudeWeight =
searcher.createWeight(magnitudeClause, scoreMode, boost);
return new VectorWeight(this, weightClauses, magnitudeWeight, searcher,
scoreMode, similarityFunc, queryVectorMagnitude);
}
}
Code Sample for
VectorWeight
Step 3:
Vector
Operator
public class VectorWeight extends Weight {
public VectorWeight(Query query, Collection<Weight> weightClauses,
Weight magnitudeWeight, IndexSearcher searcher, ScoreMode scoreMode,
double queryVectorMagnitude)
throws IOException { . . . }
@Override
public Scorer scorer(LeafReaderContext context) throws IOException {
List<Scorer> scorers = new ArrayList<>();
Scorer magnitudeScorer = magnitudeWeight.scorer(context);
for (Weight w : weightClauses) {
Scorer subScorer = w.scorer(context);
if (subScorer != null)
scorers.add(subScorer);
}
if (scorers.isEmpty() || magnitudeScorer == null) {
// no sub-scorers had any documents
return null;
}
else {
return new VectorScorer(this, scorers, magnitudeScorer, scoreMode,
similarityFunc, queryVectorMagnitude);
}
}
}
Code Sample for
VectorScorer
Step 3:
Vector
Operator
final class VectorScorer extends DisjunctionScorer {
VectorScorer(Weight weight, List<Scorer> subScorers, Scorer magnitudeScorer,
ScoreMode scoreMode, double vectorQueryMagnitude)
throws IOException { . . . }
@Override
protected float score(DisiWrapper topList) throws IOException {
double dotProduct = 0.0;
// Note that sub-scorer is already boosted by query vector weight
int docId = -1;
for (DisiWrapper w = topList; w != null; w = w.next) {
dotProduct += w.scorer.score();
if(docId < 0)
docId = w.scorer.docID();
}
if(this.magnitudeScorer.docID() < docId)
this.magnitudeIterator.advance(docId);
double documentMagnitude = magnitudeScorer.score();
return (float)(dotProduct /
(vectorQueryMagnitude*vectorQueryMagnitude +
documentMagnitude*documentMagnitude - dotProduct));
}
}DEMO!
So we can use
our new operators in Solr!
Step 4:
Solr
XML Query
Operator
<VectorQuery fieldName="vector">
<term weight="0.22662242">income tax advisor</term>
<term weight="0.22662242">tax advisor</term>
<term weight="0.22662242">tax consultant</term>
<term weight="0.22662242">licensed tax consultant</term>
<term weight="0.22662242">tax evaluator</term>
<term weight="0.22662242">income tax consultant</term>
<term weight="0.22662242">certified income tax preparer</term>
<term weight="0.22662242">income tax expert</term>
<term weight="0.22662242">enrolled agent</term>
<term weight="0.22662242">income tax preparer</term>
<term weight="0.22662242">corporate tax preparer</term>
<term weight="0.22662242">master tax advisor</term>
<term weight="0.22317137">tax form</term>
<term weight="0.20433334">tax specialist</term>
<term weight="0.20433334">CTP</term>
<term weight="0.20433334">tax preparer</term>
<term weight="0.19126709">tax manager</term>
<term weight="0.18835443">tax returns</term>
<term weight="0.18198828">tax associate</term>
<term weight="0.13498485">interview clients</term>
<term weight="0.11331121">data input</term>
<term weight="0.11331121">state tax</term>
<term weight="0.11331121">income statements</term>
<term weight="0.10770458">answer questions</term>
<term weight="0.10216667">tax planning</term>
<term weight="0.09099414">detect errors</term>
<term weight="0.08743438">deductions</term>
<term weight="0.07979348">data entry</term>
<term weight="0.04895036">financial information</term>
</VectorQuery>
Code Sample for
VectorQueryBuilder
Step 4:
Solr
XML
Query
Operator
public class VectorQueryBuilder extends SolrQueryBuilder {
public VectorQueryBuilder(String defaultField, Analyzer analyzer,
SolrQueryRequest req, QueryBuilder queryFactory) { . . . }
@Override
public Query getQuery(Element e) throws ParserException {
float outputBoost = DOMUtils.getAttribute(e, "weight", 1.0F);
String fieldName = DOMUtils.getAttributeOrFail(e, "fieldName");
ArrayList<Query> clauses = new ArrayList<Query>();
NodeList nl = e.getChildNodes();
final int nlLen = nl.getLength();
for (int i = 0; i < nlLen; i++) {
Node node = nl.item(i);
if (node.getNodeName().equals("term")) {
Element termElem = (Element) node;
float termWeight = DOMUtils.getAttribute(termElem, "weight", 0.5F);
String termS = DOMUtils.getNonBlankTextOrFail(termElem);
Query termQ = new VectorTermQuery(
new SpanTermQuery(
new Term(fieldName, termS.trim())),
termWeight);
clauses.add(termQ);
}
}
Query q = new VectorQuery(clauses, outputBoost);
return q;
}
}
Copy the Jar into
Your Solr lib
Step 4:
Solr
XML
Query
Operator
Update
solrconfig.xml
Step 4:
Solr
XML
Query
Operator
<config>
.
.
.
<lib dir="./lib" />
.
.
.
<queryParser name="xmlparser" class="XmlQParserPlugin">
<str name="VectorQuery">
com.accenture.sca.search.lucene.vector.VectorQueryBuilder</str>
</queryParser>
.
.
.
</config>
DEMO!
Custom Operators are Fun!
Some Final Thoughts
It’s time to break the tyranny of TF-IDF / BM-25
– They are 40+ years old!!
It’s easier than you think
–Honest!
It opens up a brave new world
–Come to the Accenture booth for my research paper on
creating a new framework for Rational & Comparable
search engine operators
THANK YOU!
Paul Nelson
Innovation Lead
paul.e.nelson@accenture.com

More Related Content

PPTX
Generics in .NET, C++ and Java
DOCX
Java programming lab_manual_by_rohit_jaiswar
DOC
Advanced Java - Praticals
PDF
Advanced Java Practical File
PPT
Chapter 4 - Defining Your Own Classes - Part I
PDF
The Ring programming language version 1.5.1 book - Part 36 of 180
PDF
Java programming lab manual
PDF
Java Lab Manual
Generics in .NET, C++ and Java
Java programming lab_manual_by_rohit_jaiswar
Advanced Java - Praticals
Advanced Java Practical File
Chapter 4 - Defining Your Own Classes - Part I
The Ring programming language version 1.5.1 book - Part 36 of 180
Java programming lab manual
Java Lab Manual

What's hot (20)

DOCX
Java PRACTICAL file
PDF
The Ring programming language version 1.3 book - Part 5 of 88
PDF
The Ring programming language version 1.2 book - Part 78 of 84
PDF
The Ring programming language version 1.5.2 book - Part 37 of 181
PDF
The Ring programming language version 1.8 book - Part 80 of 202
DOC
CS2309 JAVA LAB MANUAL
PDF
Pavel kravchenko obj c runtime
PDF
The Ring programming language version 1.5.3 book - Part 83 of 184
PPT
JAVA CONCEPTS
PDF
The Ring programming language version 1.5.4 book - Part 73 of 185
PDF
Java Day-7
DOCX
Java practical
PPT
Simple Java Programs
PPT
Chapter 2 - Getting Started with Java
PDF
The Ring programming language version 1.9 book - Part 84 of 210
PDF
Intake 38 5
PDF
Wrapper classes
PDF
Java Day-6
PDF
The Ring programming language version 1.2 book - Part 5 of 84
PPTX
.Net Framework 2 fundamentals
Java PRACTICAL file
The Ring programming language version 1.3 book - Part 5 of 88
The Ring programming language version 1.2 book - Part 78 of 84
The Ring programming language version 1.5.2 book - Part 37 of 181
The Ring programming language version 1.8 book - Part 80 of 202
CS2309 JAVA LAB MANUAL
Pavel kravchenko obj c runtime
The Ring programming language version 1.5.3 book - Part 83 of 184
JAVA CONCEPTS
The Ring programming language version 1.5.4 book - Part 73 of 185
Java Day-7
Java practical
Simple Java Programs
Chapter 2 - Getting Started with Java
The Ring programming language version 1.9 book - Part 84 of 210
Intake 38 5
Wrapper classes
Java Day-6
The Ring programming language version 1.2 book - Part 5 of 84
.Net Framework 2 fundamentals
Ad

Similar to Creating a Custom Tanimoto or Cosine Similarity Vector Operator for Lucene / Solr (20)

PPT
Lec 4,5
PDF
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
PDF
Interactive Questions and Answers - London Information Retrieval Meetup
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPT
Ir models
PPT
Text Representation methods in Natural language processing
PDF
Mp2420852090
PDF
How Vector Search Transforms Information Retrieval?
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Searching with vectors
PPTX
Vectors in Search - Towards More Semantic Matching
PPTX
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
PPT
processing of vector vector analysis modes
PDF
PromQL Deep Dive - The Prometheus Query Language
PPTX
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
PPTX
IRT Unit_ 2.pptx
PDF
Information Retrieval
PPTX
Vector space model12345678910111213.pptx
PPTX
The vector space model
Lec 4,5
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Interactive Questions and Answers - London Information Retrieval Meetup
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Ir models
Text Representation methods in Natural language processing
Mp2420852090
How Vector Search Transforms Information Retrieval?
Haystack 2019 - Search with Vectors - Simon Hughes
Searching with vectors
Vectors in Search - Towards More Semantic Matching
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
processing of vector vector analysis modes
PromQL Deep Dive - The Prometheus Query Language
unit -4MODELING AND RETRIEVAL EVALUATION
IRT Unit_ 2.pptx
Information Retrieval
Vector space model12345678910111213.pptx
The vector space model
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Creating a Custom Tanimoto or Cosine Similarity Vector Operator for Lucene / Solr

  • 1. Activate Conference - 2019 Custom Tanimoto & Cosine Similarity Vector Operator VECTOR OPERATORS FOR SOLR
  • 2. Accenture at-a-glance 459,000 Employees 16 Major Industry Segments 52 Countries operating in Global Delivery Network $41bn rev 10.3% growth in local currency $1bn+ in V&A $1.7bn Investment on R&D and Training
  • 3. 100% of leading industry analyst reports rank us in leader category 3K+ data scientists 1,600+ Patents and patents-pending 100+ alliance partners 20+ innovation centers That’s Applied Intelligence 20K+ professionals 20+ years of advanced analytics experience 6K+ deep AI experts 250+ apps and solutions
  • 4. Search & Content Analytics Number of projects by Search Engine Technology Unstructured Content Search Natural Language High-Volume, AdHoc Analytics San Diego Bracknell UK Prague CZ Washington DC (HQ) Frankfurt DE Manila, PH 21 Search Engine Architects 154 Search Engine Engineers San Jose, CR Near Shore Development Center: 20+ Deep Search and Big Data Experts Near Shore Development Center: 60+ Deep Search and Big Data Experts 2 0 0 5 200 Employees 800+ Customers STARTED
  • 5. Why Create a Vector Operator? You have vectors to search “The only good score is a percentage score” Your documents have intimate relations with Neural Network embedding vectors You want to precompute all your term weights, because you don’t trust anyone else to do it right We can think of four reasons…
  • 6. For SOC O*NET Codes… Example Vector income tax advisor:0.22662242;tax advisor:0.22662242;tax consultant:0.22662242; licensed tax consultant:0.22662242;tax evaluator:0.22662242; income tax consultant:0.22662242;certified income tax preparer:0.22662242; income tax expert:0.22662242;enrolled agent:0.22662242; income tax preparer:0.22662242;corporate tax preparer:0.22662242; master tax advisor:0.22662242;tax form:0.22317137;tax specialist:0.20433334; CTP:0.20433334;tax preparer:0.20433334;tax manager:0.19126709; tax returns:0.18835443;tax associate:0.18198828;interview clients:0.13498485; data input:0.11331121;state tax:0.11331121;income statements:0.11331121; answer questions:0.10770458;tax planning:0.10216667;detect errors:0.09099414; deductions:0.08743438;data entry:0.07979348;financial information:0.04895036
  • 7. Create a custom analyzer Step 1: Indexing Vectors How to index weight values? – Option 1: use position data (easiest, most efficient) – Option 2: use payload data But, when using Position data… Must be in order (lowest to highest) Vector: “hello:0.5;world:0.4;love:0.3” Analyzer Index Text Weight hello 0.5 world 0.4 love 0.3
  • 8. Code sample Step 1: Indexing Vectors @Override public final boolean incrementToken() throws IOException { . . // READ TOKEN TEXT: . while( (c = this.input.read()) != -1 && vectorStringSize < MAX_VECTOR_CHARS) vectorString[vectorStringSize++] = (char)c; . . // EXTRACT THE NEXT TOKEN . termAtt.setEmpty(); // Loop through characters in vectorString[] to fetch characters of each token termAtt.append(tokenChar); termAtt.setLength(numTokenChars); // trim trailing whitespace . . // Extract the weight . // Loop through characters in vectorString[] to fetch characters of the weight // Convert to weight value /// Output the new token int newPosition = (int)(weight*1000.0 + 0.5); posIncrAtt.setPositionIncrement(minZero(newPosition - curPosition)); curPosition = newPosition; sumOfSquares += weight*weight; return true; } DEMO!
  • 9. Create a custom term operator Step 2: Vector Term Operator Query: hello:0.5 VectorTermQuery VectorTermWeight VectorTermScorer Index Stats Lucene Weight: Compute values that are constant for all documents, e.g. IDF Lucene Scorer: Match documents & compute scores, document by document Lucene Query: Represent query, Independent of the index Index Postings
  • 10. Code Sample for Query & Weight Step 2: Vector Term Operator public class VectorTermQuery extends Query implements NormalizedBoostType { public VectorTermQuery(SpanQuery query, float outputBoost) { . . . } @Override public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { final SpanWeight tmp_innerWeight = (SpanWeight)searcher.createWeight(query, ScoreMode.COMPLETE, 1f); return new VectorTermWeight(this, tmp_innerWeight, this.outputBoost); } public class VectorTermWeight extends Weight implements NormalizedBoostType{ protected VectorTermWeight(Query query, SpanWeight innerWeight, float outputBoost) { . . . } @Override public Scorer scorer(LeafReaderContext context) throws IOException { ScorerSupplier innerScorerSupplier = innerWeight.scorerSupplier(context); if (innerScorerSupplier == null) return null; final SpanScorer innerScorer = (SpanScorer)innerScorerSupplier.get(Long.MAX_VALUE); return new VectorTermScorer(innerWeight, innerScorer, outputBoost); } }
  • 11. Code Sample for Scorer Step 2: Vector Term Operator public final class VectorTermScorer extends Scorer implements NormalizedBoostType { public VectorTermScorer(Weight weight, SpanScorer innerScorer, float outputBoost) { . . . } @Override public TwoPhaseIterator twoPhaseIterator() { return innerScorer.twoPhaseIterator(); } @Override public DocIdSetIterator iterator() { return innerScorer.iterator(); } @Override public int docID() { return innerScorer.docID(); } @Override public float score() throws IOException { Spans spans = innerScorer.getSpans(); int startPos = spans.nextStartPosition()+1; return outputBoost*startPos/1000.0F; } @Override public float getBoost() { return outputBoost; } }` DEMO!
  • 12. Multiple vector terms = A vector! Step 3: The Vector Operator Vector george martha washington custis 0.10 0.65 0.90 𝑄 ∙ 𝐷 𝑄 2 + 𝐷 2 − 𝑄 ∙ 𝐷 INDEX 0.5 0.7 0.1 0.75Query Weights: Document Weights: Document Magnitude 1.3 Query Magnitude = 0.52 + 0.72 + 0.12 + 0.752 = 1.146 Tanimoto similarity: Q·D = 0.5*0.1 + 0.7*0.65 + 0.75*0.9 = 1.18 Final Score = 1.18 / (1.1462 + 1.32 – 1.18) = 0.647
  • 13. Lucene “Pulls” documents from Scorers TERM(george) TERM(washington) TERM(president) AND OR INDEX next()
  • 14. Code Sample for Vector Query Step 3: Vector Operator public final class VectorQuery extends Query implements Iterable<Query> { public VectorQuery(Collection<Query> queryClauses) { . . // Compute the magnitude of the query vector . double sumOfSquares = 0.0; for(Query q : queryClauses) { float clauseBoost = q.getBoost(); sumOfSquares += clauseBoost*clauseBoost; } this.queryVectorMagnitude = Math.sqrt(sumOfSquares); } @Override public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException { ArrayList<Weight> weightClauses = new ArrayList<Weight>(queryClauses.length); for (Query queryClause : queryClauses) { weightClauses.add(searcher.createWeight(queryClause, scoreMode, boost)); } Weight magnitudeWeight = searcher.createWeight(magnitudeClause, scoreMode, boost); return new VectorWeight(this, weightClauses, magnitudeWeight, searcher, scoreMode, similarityFunc, queryVectorMagnitude); } }
  • 15. Code Sample for VectorWeight Step 3: Vector Operator public class VectorWeight extends Weight { public VectorWeight(Query query, Collection<Weight> weightClauses, Weight magnitudeWeight, IndexSearcher searcher, ScoreMode scoreMode, double queryVectorMagnitude) throws IOException { . . . } @Override public Scorer scorer(LeafReaderContext context) throws IOException { List<Scorer> scorers = new ArrayList<>(); Scorer magnitudeScorer = magnitudeWeight.scorer(context); for (Weight w : weightClauses) { Scorer subScorer = w.scorer(context); if (subScorer != null) scorers.add(subScorer); } if (scorers.isEmpty() || magnitudeScorer == null) { // no sub-scorers had any documents return null; } else { return new VectorScorer(this, scorers, magnitudeScorer, scoreMode, similarityFunc, queryVectorMagnitude); } } }
  • 16. Code Sample for VectorScorer Step 3: Vector Operator final class VectorScorer extends DisjunctionScorer { VectorScorer(Weight weight, List<Scorer> subScorers, Scorer magnitudeScorer, ScoreMode scoreMode, double vectorQueryMagnitude) throws IOException { . . . } @Override protected float score(DisiWrapper topList) throws IOException { double dotProduct = 0.0; // Note that sub-scorer is already boosted by query vector weight int docId = -1; for (DisiWrapper w = topList; w != null; w = w.next) { dotProduct += w.scorer.score(); if(docId < 0) docId = w.scorer.docID(); } if(this.magnitudeScorer.docID() < docId) this.magnitudeIterator.advance(docId); double documentMagnitude = magnitudeScorer.score(); return (float)(dotProduct / (vectorQueryMagnitude*vectorQueryMagnitude + documentMagnitude*documentMagnitude - dotProduct)); } }DEMO!
  • 17. So we can use our new operators in Solr! Step 4: Solr XML Query Operator <VectorQuery fieldName="vector"> <term weight="0.22662242">income tax advisor</term> <term weight="0.22662242">tax advisor</term> <term weight="0.22662242">tax consultant</term> <term weight="0.22662242">licensed tax consultant</term> <term weight="0.22662242">tax evaluator</term> <term weight="0.22662242">income tax consultant</term> <term weight="0.22662242">certified income tax preparer</term> <term weight="0.22662242">income tax expert</term> <term weight="0.22662242">enrolled agent</term> <term weight="0.22662242">income tax preparer</term> <term weight="0.22662242">corporate tax preparer</term> <term weight="0.22662242">master tax advisor</term> <term weight="0.22317137">tax form</term> <term weight="0.20433334">tax specialist</term> <term weight="0.20433334">CTP</term> <term weight="0.20433334">tax preparer</term> <term weight="0.19126709">tax manager</term> <term weight="0.18835443">tax returns</term> <term weight="0.18198828">tax associate</term> <term weight="0.13498485">interview clients</term> <term weight="0.11331121">data input</term> <term weight="0.11331121">state tax</term> <term weight="0.11331121">income statements</term> <term weight="0.10770458">answer questions</term> <term weight="0.10216667">tax planning</term> <term weight="0.09099414">detect errors</term> <term weight="0.08743438">deductions</term> <term weight="0.07979348">data entry</term> <term weight="0.04895036">financial information</term> </VectorQuery>
  • 18. Code Sample for VectorQueryBuilder Step 4: Solr XML Query Operator public class VectorQueryBuilder extends SolrQueryBuilder { public VectorQueryBuilder(String defaultField, Analyzer analyzer, SolrQueryRequest req, QueryBuilder queryFactory) { . . . } @Override public Query getQuery(Element e) throws ParserException { float outputBoost = DOMUtils.getAttribute(e, "weight", 1.0F); String fieldName = DOMUtils.getAttributeOrFail(e, "fieldName"); ArrayList<Query> clauses = new ArrayList<Query>(); NodeList nl = e.getChildNodes(); final int nlLen = nl.getLength(); for (int i = 0; i < nlLen; i++) { Node node = nl.item(i); if (node.getNodeName().equals("term")) { Element termElem = (Element) node; float termWeight = DOMUtils.getAttribute(termElem, "weight", 0.5F); String termS = DOMUtils.getNonBlankTextOrFail(termElem); Query termQ = new VectorTermQuery( new SpanTermQuery( new Term(fieldName, termS.trim())), termWeight); clauses.add(termQ); } } Query q = new VectorQuery(clauses, outputBoost); return q; } }
  • 19. Copy the Jar into Your Solr lib Step 4: Solr XML Query Operator
  • 20. Update solrconfig.xml Step 4: Solr XML Query Operator <config> . . . <lib dir="./lib" /> . . . <queryParser name="xmlparser" class="XmlQParserPlugin"> <str name="VectorQuery"> com.accenture.sca.search.lucene.vector.VectorQueryBuilder</str> </queryParser> . . . </config> DEMO!
  • 21. Custom Operators are Fun! Some Final Thoughts It’s time to break the tyranny of TF-IDF / BM-25 – They are 40+ years old!! It’s easier than you think –Honest! It opens up a brave new world –Come to the Accenture booth for my research paper on creating a new framework for Rational & Comparable search engine operators
  • 22. THANK YOU! Paul Nelson Innovation Lead paul.e.nelson@accenture.com