Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

Automatically Build Solr Synonym List
Using Machine Learning
Chao Han
VP, Head of Data Science, Lucidworks

Goal
• Automatically generate Solr synonym list that includes synonyms, common
misspellings and misplaced blank spaces. Choose the right Solr synonym format
(e.g., one or bi-directional).
• Examples:
• Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook
• Acronym: playstation, ps
• Misspelling: accesory, accesoire, accessoire, accessorei => accessory
• Misplaced blank spaces: book end, bookend; whirl pool => whirlpool

Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works

Existing Methods and Challenges
• Knowledge-base methods, such as utilizing WordNet, do not have
good coverage of customer’s own ontology.
• Example result from WordNet on an ecommerce data:
•Lack of usefulness:
• mankind, humanity; luck, chance; interference, noise
•Missing context specific synonyms:
• galaxy, Samsung galaxy; noise, quiet; vac, vacuum;
•Do not update frequently.

Existing Methods and Challenges
• Find synonyms from word2vec
• Example result from word2vec on an ecommerce data:
• Provide related words instead of inter-changeable words:
• king, queen; red, blue; broom, floor;
• Provide surrounding words:
• battery, rechargeable; unlocked, phone; power, supply;
• Sensitive to hyper-parameters; local optimization;

Proposed method : Step 1 – Find similar queries
• Utilize customer behavior data to focus on queries that lead to similar set of clicked
documents, then further extract token/phrase wise synonyms.
Query Doc Set Num of Clicks
apple mac charger 1 500
Mac power 1 200
Mac power 2 100
Mac power 3 50
Use Jaccard Index to measure query similarities:
𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 =
|𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2|
|𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2|
Doc Set is weighted by number of clicks to de-noise.

Proposed method : Step 2 – Query pre-processing
• Stemming, stop words removal
• Find misspellings separately and correct misspellings in queries:
• If leave misspellings in: mattress, matress, mattrass, mattresss
which should be: matress, mattrass, mattresss => mattress
• Identify phrases in queries to find multi-word synonyms: mac, mac_book

Proposed method : Step 3 – Extract synonyms
• Extract synonym (token/phrases) from queries by finding token/phrases which
before/after the same word:
• E.g. Similar query: laptop charger, laptop power
Synonym: charger, power
Similar query: playstation console, ps console
Synonym: playstation, ps
• Measure synonym similarity by occurrence in similar query adjusted by the counts
of synonym in the corpus.

Proposed method : Step 4 – De-noise
• Drop the synonym pair that exist in the same query.
• Use graph model to find relationships among synonyms to put multiple synonyms
into the same set and to drop non-synonyms.
Synonym group: mac, apple mac, mac book
LCD
tv
tv
LED tv
mac
book
mac
apple
mac

Proposed method : Step 5 – Categorize output
• A tree based model is built based on features generated from the above steps
to help choose from synonym vs context:
• Example features: synonym similarity, number of context the synonym shown
up, token overlapping, synonym counts etc.

Evaluation and comparison with word2vec
• Run word2vec on catalog and trim the rare words that are not in queries. (with
the same misspelling and phrase extraction steps)

Evaluation and comparison with word2vec
• Manually evaluated synonym pairs generated from the ecommerce dataset.
Method Precision Recall F1
LW synonym job 83% 81% 82%
word2vec 31% 28% 29%
Word2vec with de-
noise step
45% 25% 32%

Spell Correction in Fusion 4.0:
• An offline job to find misspellings and provide corrections based on the number of
occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job
are:
• If query clicks are captured after Solr spell checker was turned on, then these misspellings
found from click data are mainly identifying erroneous corrections or no corrections from Solr.
• It allow offline human review to make sure the changes are all correct. If user have a dictionary
(e.g. product catalog) to check against the list, the job will go through the result list to make
sure misspellings do not exist in the dictionary and corrections do exist in dictionary.

• High accuracy rate (96%). In addition to basic Solr spell checker settings :
• When there are multiple possible corrections, we rank corrections based on multiple criteria in
addition to edit distance.
• Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to
the query length to provide more wiggle room for long queries.
• Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr
spell check. E.g., it does not limit the maximum number of possible matches to review
(maxInspections parameter in Solr).

• Several fields are provided to facilitate the reviewing process:
• by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more
probable corrections at the top.
• Soundex or last character match indicator.

• Several additional fields are provided to disclose relationship among the
token corrections and phrase corrections to help further reduce the list:
• The suggested_corrections field help automatically choose to use phrase level correction or token level
correction. If there is low confidence of the correction, a “review” label is attached.

• The resulting corrections can be used in various ways, for example:
• Put into synonym list in Solr to perform auto correction.
• Help evaluate and guide Solr spellcheck configuration.
• Put into typeahead or autosuggest list.
• Perform document cleansing (e.g. clean product catalog or medical records) by
mapping misspellings to corrections.

Phrase Extraction in Fusion:
• Income tax -> tax Income tax -> income
• a Spark job detects commonly co-occurring terms phrases
• Usage:
A. In the query pipeline, boost on any phrase that appears,
e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2
B. Treat phrases as a single token (ipad_case) and feed into downstream
jobs such as clustering/classification/synonym detection.

Synonym review process in Fusion 4.2

Automatic tail query rewriting

Tail rewriting at query time
User searched for “red case for macbook.pro”
See this: After query rewriting: “macbook pro case”~10^2 color: red

Future works
• Utilize query rewrites in session logs.
• Explore deep learning embeddings and attention weights.
source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf)
• Evaluate results on more types of data.

Thank you!
Chao Han
VP, Head of Data Science, Lucidworks

Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

More Related Content

What's hot (20)

Similar to Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks (17)

More from Lucidworks (20)

Recently uploaded (20)

Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

Editor's Notes