SlideShare a Scribd company logo
Automatically Build Solr Synonym List
Using Machine Learning
Chao Han
VP, Head of Data Science, Lucidworks
Goal
• Automatically generate Solr synonym list that includes synonyms, common
misspellings and misplaced blank spaces. Choose the right Solr synonym format
(e.g., one or bi-directional).
• Examples:
• Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook
• Acronym: playstation, ps
• Misspelling: accesory, accesoire, accessoire, accessorei => accessory
• Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Existing Methods and Challenges
• Knowledge-base methods, such as utilizing WordNet, do not have
good coverage of customer’s own ontology.
• Example result from WordNet on an ecommerce data:
•Lack of usefulness:
• mankind, humanity; luck, chance; interference, noise
•Missing context specific synonyms:
• galaxy, Samsung galaxy; noise, quiet; vac, vacuum;
•Do not update frequently.
Existing Methods and Challenges
• Find synonyms from word2vec
• Example result from word2vec on an ecommerce data:
• Provide related words instead of inter-changeable words:
• king, queen; red, blue; broom, floor;
• Provide surrounding words:
• battery, rechargeable; unlocked, phone; power, supply;
• Sensitive to hyper-parameters; local optimization;
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Proposed method : Step 1 – Find similar queries
• Utilize customer behavior data to focus on queries that lead to similar set of clicked
documents, then further extract token/phrase wise synonyms.
Query Doc Set Num of Clicks
apple mac charger 1 500
apple mac charger 2 300
apple mac charger 3 100
apple mac charger 4 30
Mac power 1 200
Mac power 2 100
Mac power 3 50
Use Jaccard Index to measure query similarities:
𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 =
|𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2|
|𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2|
Doc Set is weighted by number of clicks to de-noise.
Proposed method : Step 2 – Query pre-processing
• Stemming, stop words removal
• Find misspellings separately and correct misspellings in queries:
• If leave misspellings in: mattress, matress, mattrass, mattresss
which should be: matress, mattrass, mattresss => mattress
• Identify phrases in queries to find multi-word synonyms: mac, mac_book
Proposed method : Step 3 – Extract synonyms
• Extract synonym (token/phrases) from queries by finding token/phrases which
before/after the same word:
• E.g. Similar query: laptop charger, laptop power
Synonym: charger, power
Similar query: playstation console, ps console
Synonym: playstation, ps
• Measure synonym similarity by occurrence in similar query adjusted by the counts
of synonym in the corpus.
Proposed method : Step 4 – De-noise
• Drop the synonym pair that exist in the same query.
• Use graph model to find relationships among synonyms to put multiple synonyms
into the same set and to drop non-synonyms.
Synonym group: mac, apple mac, mac book
LCD
tv
tv
LED tv
mac
book
mac
apple
mac
Proposed method : Step 5 – Categorize output
• A tree based model is built based on features generated from the above steps
to help choose from synonym vs context:
• Example features: synonym similarity, number of context the synonym shown
up, token overlapping, synonym counts etc.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Evaluation and comparison with word2vec
• Run word2vec on catalog and trim the rare words that are not in queries. (with
the same misspelling and phrase extraction steps)
Evaluation and comparison with word2vec
• Manually evaluated synonym pairs generated from the ecommerce dataset.
Method Precision Recall F1
LW synonym job 83% 81% 82%
word2vec 31% 28% 29%
Word2vec with de-
noise step
45% 25% 32%
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Spell Correction in Fusion 4.0:
• An offline job to find misspellings and provide corrections based on the number of
occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job
are:
• If query clicks are captured after Solr spell checker was turned on, then these misspellings
found from click data are mainly identifying erroneous corrections or no corrections from Solr.
• It allow offline human review to make sure the changes are all correct. If user have a dictionary
(e.g. product catalog) to check against the list, the job will go through the result list to make
sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
Spell Correction in Fusion 4.0:
• High accuracy rate (96%). In addition to basic Solr spell checker settings :
• When there are multiple possible corrections, we rank corrections based on multiple criteria in
addition to edit distance.
• Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to
the query length to provide more wiggle room for long queries.
• Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr
spell check. E.g., it does not limit the maximum number of possible matches to review
(maxInspections parameter in Solr).
Spell Correction in Fusion 4.0:
• Several fields are provided to facilitate the reviewing process:
• by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more
probable corrections at the top.
• Soundex or last character match indicator.
Spell Correction in Fusion 4.0:
• Several additional fields are provided to disclose relationship among the
token corrections and phrase corrections to help further reduce the list:
• The suggested_corrections field help automatically choose to use phrase level correction or token level
correction. If there is low confidence of the correction, a “review” label is attached.
Spell Correction in Fusion 4.0:
• The resulting corrections can be used in various ways, for example:
• Put into synonym list in Solr to perform auto correction.
• Help evaluate and guide Solr spellcheck configuration.
• Put into typeahead or autosuggest list.
• Perform document cleansing (e.g. clean product catalog or medical records) by
mapping misspellings to corrections.
Phrase Extraction in Fusion:
• Income tax -> tax Income tax -> income
• a Spark job detects commonly co-occurring terms phrases
• Usage:
A. In the query pipeline, boost on any phrase that appears,
e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2
B. Treat phrases as a single token (ipad_case) and feed into downstream
jobs such as clustering/classification/synonym detection.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Synonym review process in Fusion 4.2
Automatic tail query rewriting
Tail reason investigation
Tail rewriting at query time
User searched for “red case for macbook.pro”
See this: After query rewriting: “macbook pro case”~10^2 color: red
Future works
• Utilize query rewrites in session logs.
• Explore deep learning embeddings and attention weights.
source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf)
• Evaluate results on more types of data.
Thank you!
Chao Han
VP, Head of Data Science, Lucidworks

More Related Content

PDF
آموزش ساختمان داده ها - بخش اول
PPTX
NLP techniques for log analysis
PDF
Gradient boosting in practice: a deep dive into xgboost
PDF
How the Lucene More Like This Works
PPT
Logstash
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Understanding Apache Kafka® Latency at Scale
PPTX
Solr consistency and recovery internals
آموزش ساختمان داده ها - بخش اول
NLP techniques for log analysis
Gradient boosting in practice: a deep dive into xgboost
How the Lucene More Like This Works
Logstash
Apache Spark Core—Deep Dive—Proper Optimization
Understanding Apache Kafka® Latency at Scale
Solr consistency and recovery internals

What's hot (20)

PDF
Getting Started with Confluent Schema Registry
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
SHACL in Apache jena - ApacheCon2020
PPTX
Introduction to PyTorch
PDF
Introduction to Grafana
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Logstash-Elasticsearch-Kibana
PDF
Design pattern cheat sheet
PDF
Introduction to Apache Beam
PPTX
Evening out the uneven: dealing with skew in Flink
PPT
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
PDF
Common issues with Apache Kafka® Producer
PDF
Producer Performance Tuning for Apache Kafka
PDF
Doing Synonyms Right - John Marquiss, Wolters Kluwer
PDF
Saga pattern and event sourcing with kafka
PDF
Consumer offset management in Kafka
PDF
Apache spark - Spark's distributed programming model
PPTX
A note on word embedding
Getting Started with Confluent Schema Registry
Dense Retrieval with Apache Solr Neural Search.pdf
Where is my bottleneck? Performance troubleshooting in Flink
SHACL in Apache jena - ApacheCon2020
Introduction to PyTorch
Introduction to Grafana
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Logstash-Elasticsearch-Kibana
Design pattern cheat sheet
Introduction to Apache Beam
Evening out the uneven: dealing with skew in Flink
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Common issues with Apache Kafka® Producer
Producer Performance Tuning for Apache Kafka
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Saga pattern and event sourcing with kafka
Consumer offset management in Kafka
Apache spark - Spark's distributed programming model
A note on word embedding
Ad

Similar to Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks (17)

PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PPTX
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
PPTX
Similarity computation exploiting the semantic and syntactic inherent structu...
PDF
IRJET- Vernacular Language Spell Checker & Autocorrection
PPTX
Query Understanding
PPTX
Custom spellchecker for SOLR
PPTX
Machine Aided Indexer
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PPTX
The well tempered search application
KEY
Evolution: It's a process
PDF
Webinar: Simpler Semantic Search with Solr
PDF
Find it, possibly also near you!
PDF
2011 Search Query Rewrites - Synonyms & Acronyms
PDF
EasyChair-Preprint-7375.pdf
PDF
Query Understanding at LinkedIn [Talk at Facebook]
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA
Similarity computation exploiting the semantic and syntactic inherent structu...
IRJET- Vernacular Language Spell Checker & Autocorrection
Query Understanding
Custom spellchecker for SOLR
Machine Aided Indexer
Dice.com Bay Area Search - Beyond Learning to Rank Talk
The well tempered search application
Evolution: It's a process
Webinar: Simpler Semantic Search with Solr
Find it, possibly also near you!
2011 Search Query Rewrites - Synonyms & Acronyms
EasyChair-Preprint-7375.pdf
Query Understanding at LinkedIn [Talk at Facebook]
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
sap open course for s4hana steps from ECC to s4
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf

Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

  • 1. Automatically Build Solr Synonym List Using Machine Learning Chao Han VP, Head of Data Science, Lucidworks
  • 2. Goal • Automatically generate Solr synonym list that includes synonyms, common misspellings and misplaced blank spaces. Choose the right Solr synonym format (e.g., one or bi-directional). • Examples: • Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook • Acronym: playstation, ps • Misspelling: accesory, accesoire, accessoire, accessorei => accessory • Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
  • 3. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 4. Existing Methods and Challenges • Knowledge-base methods, such as utilizing WordNet, do not have good coverage of customer’s own ontology. • Example result from WordNet on an ecommerce data: •Lack of usefulness: • mankind, humanity; luck, chance; interference, noise •Missing context specific synonyms: • galaxy, Samsung galaxy; noise, quiet; vac, vacuum; •Do not update frequently.
  • 5. Existing Methods and Challenges • Find synonyms from word2vec • Example result from word2vec on an ecommerce data: • Provide related words instead of inter-changeable words: • king, queen; red, blue; broom, floor; • Provide surrounding words: • battery, rechargeable; unlocked, phone; power, supply; • Sensitive to hyper-parameters; local optimization;
  • 6. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 7. Proposed method : Step 1 – Find similar queries • Utilize customer behavior data to focus on queries that lead to similar set of clicked documents, then further extract token/phrase wise synonyms. Query Doc Set Num of Clicks apple mac charger 1 500 apple mac charger 2 300 apple mac charger 3 100 apple mac charger 4 30 Mac power 1 200 Mac power 2 100 Mac power 3 50 Use Jaccard Index to measure query similarities: 𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 = |𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2| |𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2| Doc Set is weighted by number of clicks to de-noise.
  • 8. Proposed method : Step 2 – Query pre-processing • Stemming, stop words removal • Find misspellings separately and correct misspellings in queries: • If leave misspellings in: mattress, matress, mattrass, mattresss which should be: matress, mattrass, mattresss => mattress • Identify phrases in queries to find multi-word synonyms: mac, mac_book
  • 9. Proposed method : Step 3 – Extract synonyms • Extract synonym (token/phrases) from queries by finding token/phrases which before/after the same word: • E.g. Similar query: laptop charger, laptop power Synonym: charger, power Similar query: playstation console, ps console Synonym: playstation, ps • Measure synonym similarity by occurrence in similar query adjusted by the counts of synonym in the corpus.
  • 10. Proposed method : Step 4 – De-noise • Drop the synonym pair that exist in the same query. • Use graph model to find relationships among synonyms to put multiple synonyms into the same set and to drop non-synonyms. Synonym group: mac, apple mac, mac book LCD tv tv LED tv mac book mac apple mac
  • 11. Proposed method : Step 5 – Categorize output • A tree based model is built based on features generated from the above steps to help choose from synonym vs context: • Example features: synonym similarity, number of context the synonym shown up, token overlapping, synonym counts etc.
  • 12. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 13. Evaluation and comparison with word2vec • Run word2vec on catalog and trim the rare words that are not in queries. (with the same misspelling and phrase extraction steps)
  • 14. Evaluation and comparison with word2vec • Manually evaluated synonym pairs generated from the ecommerce dataset. Method Precision Recall F1 LW synonym job 83% 81% 82% word2vec 31% 28% 29% Word2vec with de- noise step 45% 25% 32%
  • 15. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 16. Spell Correction in Fusion 4.0: • An offline job to find misspellings and provide corrections based on the number of occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job are: • If query clicks are captured after Solr spell checker was turned on, then these misspellings found from click data are mainly identifying erroneous corrections or no corrections from Solr. • It allow offline human review to make sure the changes are all correct. If user have a dictionary (e.g. product catalog) to check against the list, the job will go through the result list to make sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
  • 17. Spell Correction in Fusion 4.0: • High accuracy rate (96%). In addition to basic Solr spell checker settings : • When there are multiple possible corrections, we rank corrections based on multiple criteria in addition to edit distance. • Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to the query length to provide more wiggle room for long queries. • Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr spell check. E.g., it does not limit the maximum number of possible matches to review (maxInspections parameter in Solr).
  • 18. Spell Correction in Fusion 4.0: • Several fields are provided to facilitate the reviewing process: • by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more probable corrections at the top. • Soundex or last character match indicator.
  • 19. Spell Correction in Fusion 4.0: • Several additional fields are provided to disclose relationship among the token corrections and phrase corrections to help further reduce the list: • The suggested_corrections field help automatically choose to use phrase level correction or token level correction. If there is low confidence of the correction, a “review” label is attached.
  • 20. Spell Correction in Fusion 4.0: • The resulting corrections can be used in various ways, for example: • Put into synonym list in Solr to perform auto correction. • Help evaluate and guide Solr spellcheck configuration. • Put into typeahead or autosuggest list. • Perform document cleansing (e.g. clean product catalog or medical records) by mapping misspellings to corrections.
  • 21. Phrase Extraction in Fusion: • Income tax -> tax Income tax -> income • a Spark job detects commonly co-occurring terms phrases • Usage: A. In the query pipeline, boost on any phrase that appears, e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2 B. Treat phrases as a single token (ipad_case) and feed into downstream jobs such as clustering/classification/synonym detection.
  • 22. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 23. Synonym review process in Fusion 4.2
  • 24. Automatic tail query rewriting
  • 26. Tail rewriting at query time User searched for “red case for macbook.pro” See this: After query rewriting: “macbook pro case”~10^2 color: red
  • 27. Future works • Utilize query rewrites in session logs. • Explore deep learning embeddings and attention weights. source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf) • Evaluate results on more types of data.
  • 28. Thank you! Chao Han VP, Head of Data Science, Lucidworks

Editor's Notes

  • #3: Synonyms list plays an important part for search. However, it usually take a long time to detect and maintain synonyms by the search or ontology group in a company. Within the context of an ecommerce search use case.
  • #5: There are experiments around automatically generating synonym already. And I will talk about two of the most popular methods here.
  • #6: Word2vec is a shallow NN trying to predict target words from near by words or wise versa. Then we take the dense vector out, basically transfer from word space to vector space and find nearest neighbors through cosine similarity. Because the vectors live in a vast high dimensional space, then two vectors can be similar in any sense. E.g. red and blue are similar bc they are both colors, broom and floor share a functional relationship. They are related but they are not inter-changeable. Then in a search application, we usually require synonym to be bi-directional and interchangeable, thus it can leads to relevancy problem. E.g. if I want a king bed sheet, I may not want queen bed sheet. Red paint is not blue paint. Due to the way that w2v model is constructed, bc it’s trying to predict context from target words, thus it tends to find surrounding words. Since w2v is a NN model that use SGD, thus it can converge to a local optimization. Overall you can see some failed examples here from w2v results is due to lack of constraint. And problem with wordnet is a mismatched semantic context between customer data and the general dictionary.
  • #8: In order to tackle the above problems, here we propose a 5 step synonym detection algorithm. Nowadays websites can easily track and store user events such as queries, result clicks and purchases, we can use this collective behavior to create clickstream or LTR models, we can also use this data to help find synonyms. First step is to find similar queries then we can further extract. This way we are putting contraints through the input data.
  • #9: Since we don’t want to put all the stemmed and non-stemmed pairs into synonym list, just leave the stemming work to Solr.
  • #10: This method looks like a naïve method without fancy modeling involved, but it turns out works pretty well. I think it’s bc it’s a straight forward way to replicate how ppl construct the language. Also here we are not projecting the words into a different vector space as in w2v, thus we are getting the first order similarity between words.
  • #11: Have to say all methods leads to noise due to the nature of click data. Synonym should be transitional. Use graph algorithm to find a community which have enough edges in the graph. (BronKerbosch clique algorithm an example from clique is : frozenset({‘ear’, ‘ear bud’, ‘earbud’, ‘earphone’, ‘headset’}),but if only require connected component would be messy: audio, headphone, ear bud, ipod, headset, earbud, head, beat, heartbeat, ie, ibeat, tour, ear headphone, earphone, ear in order to keep good recall, I’m also considering loose cliques, i.e., if two triangles have 2 edges between each other, then can say they are 1 clique, loosier than strict clique defination)
  • #12: A problem we face is some of the synonym we extracted is too abstract and does not work outside certain context. In this algorithm’s output, we find the most frequent occuring words before/after the synonym pair. We call it context pair here. In this case, the tree model predict that we should include the word console in the synonym pair to make it more clear.
  • #17: many queries misspells may due to the same tokens or phrases. So in Fusion 4, we have a new job called token and phrase wise spell checker which can help you find misspellings and suggest corrections. Solr Spell Checker Index-based, Executes at query time
  • #18: such as min prefix match, max edit distance, min length of misspelling, count thresholds of misspellings and corrections, collation check. Specifically, we apply a filter such that only pairs with edit_distance <= query_length/length_scale will be kept. E.g., if we choose length_scale=4, for queries with lengths between 4 and 7, edit distance has to be 1 to be chosen. While for queries with lengths between 8 and 11, edit distance can be 2. and is able to find comprehensive lists of spelling errors resulting from misplaced whitespace (breakWords in Solr)
  • #19: can also sort by the ratio of correction traffic over misspelling traffic to only keep high traffic boosting corrections.