Custom spellchecker for SOLR

Custom SpellSuggestor
Murthy Remella

Little history
• Spellchecker setup
• Default + word break dictionaries in collate mode
• multiple required text field information is stored in a single
text field and used as input
• Observations
* Spellchecker was moved to a different request handler

Examples
• Default spell checker limitations
• First suggestion is not always the best
‣ backpacks for gurls,
‣ Singulars/Plurals issue
‣ 15c calculator > 15c calculators
‣ Incorrect suggestions
‣ 6d tv > 6d HDTV
• Opportunities
• max paine
• fischer price

What are we looking to add?
• User behavior historically
❖ what are the similar searches ?
❖ How did user correct himself to fetch better results?
❖ How frequently did this happen?
• Context
❖ Is he looking for a toy or movie?
❖ Is this a brand name ?

Custom spellchecker
• How it works? set of algorithms which capture user input and
process and generate a set of candidate suggestion with
confidence score ( 0- 1)
Query validator Query Edits Score calculator
Confidence
generator
Suggestions

Performance inputs
• Peter Novig’s edit generation vs Faroo’s
• Our algorithm considers the frequencies of all combinations of bigrams in the query. It generates all these
combinations and evaluates them on the fly. To prevent the algorithm from being un-predictable in terms of
performance for varying input token numbers we use a performance budget. A performance budget is used
instead of just counting tokens because in queries such as "How to train a dragon" only the tokens "train" and
"dragon" generate alternatives making it a two token query from performance point of view. We currently set
the budget at 50k combinations which reduces trigger rate by 0.6% of actual trigger rate. The rational for
limiting the number of tokens is that in a long query not correcting a token is not as adverse to search results
as a short query
Average Processing time
Peter Norvig 5.3 ms
Faroo’s 1.8 ms
Improvement 2.9x

Precision Metrics
• Precision metrics generated for top 1 million searches over an
year
• Conversion and $ benefits ( I can’t show data here :)
Threshold
Queries
corrected
Precision
0.9+ multi token 12000 99%
0.9+ single
token
4975 94%
0.87-.0.9 22190 81%

NDCG Measurements
Threshold Type of dataset Score
0.9+
default
spellchecker
0.34
0.9+
custom
spellchecker
0.52
0.87-.0.9
default
spellchecker
0.38
0.87-0.9
custom
spellchecker
0.46
Normalized discounted cumulative gain (NDCG) measures the performance of a recommendation system based on the
graded relevance of the recommended entities. It varies from 0.0 to 1.0, with 1.0 representing the ideal ranking of the
entities. This metric is commonly used in information retrieval and to evaluate the performance of web search engines.

What is cooking?
• Improving trigger rate
• Trigger rate is around 1-3% in our current model for confidence
scores of 0.87+
• Null searches are around 8% (approx.)
• Considering noise channel for probability of spell modification
• Opportunities in identifying the relation between the tokens to be
used for spell suggestion

Questions?
If you have any questions or would like to bounce
your ideas, please write to
murthy.remella@target.com

How is this different?
• Find the most likely spell correction for the word
• Ex: Given, lates => it could be late, latest
• Find correction ‘c’ of all possible corrections, that maximizes the
probability of c given the original word w
• argmax P(c|w)
• argmax P(c) P(w|c) ( simplified )
• argmax language model * error model

Custom spellchecker for SOLR

More Related Content

Similar to Custom spellchecker for SOLR (20)

Recently uploaded (20)

Custom spellchecker for SOLR