SlideShare a Scribd company logo
Relevancy Hacks for eCommerce
RELEVANCY HACKS FOR ECOMMERCE
VARUN THACKER

!

@VARUNTHACKER
AGENDA
•
•
•
•
•

How to solve multiple eCommerce use cases by using the features present in Solr
Query Parsing
Building on the TF-IDF scoring model and improving it for your data set
Adding relevancy signals to your score to rank documents better
Customising search results on a per query basis
HOW DO QUERIES SCORE DOCUMENTS?
•

Example document:

{
“title” : ”LG Nexus 5”,
“brand” : ”LG”,
“category” : “Smartphones”
“tags” : “phones, android, touch”
}
•

query = LG Nexus
HOW DO QUERIES SCORE DOCUMENTS?
•
•
•

•

Scores are field relative.
I want a Query which will match against all the fields for each token.
Approach 1: Use a BooleanQuery
• Query query1 = new TermQuery(new Term("title", “lg"));
• Query query2 = new TermQuery(new Term("title", "nexus"));
• Query query3 = new TermQuery(new Term("brand", "lg"));
• Query query4 = new TermQuery(new Term("brand", "nexus"));
• Add all the queries into a BooleanQuery
• Score = query1 + query2 + query3 + query4
• This would add the match for "lg" twice.
Approach 2: Use DisjunctionMaxQuery - It automatically scores each document
with the maximum score for that document as produced by any subquery
DEFAULT SIMILARITY FACTORS
•
•
•
•
•

TF - number of occurrences of the term in the document.
IDF - Is a measure of how unique or rare the term is.
Normalisation's - Both at index time and at query time
Coordination factor - number of matches of the query term in each document
These statistics are per field
WHY THE DEFAULT SCORING MAY NOT WORK?
•
•

TF-IDF is calculated per field.
Lets take term frequency first:
• Product 1: iPad Air
• Product 2: iPad Air case. Works well with iPad 3 and iPad 2
• query = iPad
• Product 2 would rank before Product 1
• But obviously this is not what the user would be looking for
• Does iPad occurring multiple times make it more important?
• Idea - Let’s make TF = 1 for a token match
WHY THE DEFAULT SCORING MAY NOT WORK?
•

Tackling Inverse Document Frequency
• Product 1 - brown jacket
• Product 2 - leather jacket
• q = brown leather jacket
• IDF is Not a measure of usefulness but a measure of rarity.
• Should IDF from your corpus be the true judge on whether “leather” is more
important than “brown”
• Maybe you stock less brown jackets but it doesn’t mean that it is more
important than a leather jacket.
• Combine data of many stores in your vertical and compute the IDF score
offline
• Feed it back into your Custom Similarity implementation
WHY THE DEFAULT SCORING MAY NOT WORK?
•

The "tie" factor between two documents with the same number of term matches is
"fieldNorm". This means the document which contains lesser number of tokens.
FUNCTION QUERIES
•
•
•

FunctionQuery allows one to use the actual value of a field and functions of those
fields in a relevancy score.
It iterates over all documents serially applying the function
Can be multiplied into the score by using the boost param in the eDismax request
handler
INCLUDE POPULARITY DATA
•
•
•
•
•

Popularity could be anything - Maximum selling items, Highest viewed products,
trending etc.
Compute the "popularity" score offline for each document in the index.
Stick them into the document if your data set is small else you could use a
ExternalFileField
Use a function query:
•
&boost= multiple popularity score value * score
With the new expressions module coming in Lucene 4.6 it’s fairly simple to add
multiple signals into your ranking formula
• Expression expr = JavascriptCompiler.compile("_score + ln(popularity) +
ln(margin)");
ADDING CLICK THROUGH DATA
•
•

Use this on a per query basis or a set of similar queries.
We used function queries which take
• id’s and their associated boost

!

•

An external application would enable the function query depending on the search
query
BOOSTING NEWER PRODUCTS
•
•

Blindly sort the result
• &sort = release_date desc
Give preference to Newer Products
• recip(ms(NOW/DAY,pub_date),3.16e-11,1,1)
• Where recip(m, x, a, b) = a / (mx + b)
• Picking a=2, b =1, m = 3.16e-11
• Gives a boost =2 for todays product
• Gives a boost =1.3 for 1/2 year old product
• Gives a boost =1 for 1 year old product and so on
I'M STILL NOT SATISFIED!
•
•
•

Take your top N queries and use the QueryElevationComponent :)
Fix particular documents for certain queries
No scoring is taken into consideration for these queries

!

<elevate>
<query text="android phones">
<doc id="nexus 4" />
<doc id="iPhone" exclude="true"/>
</query>
</elevate>
THANK YOU
•

Questions?

More Related Content

PDF
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
PDF
Managed Search: Presented by Jacob Graves, Getty Images
PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
PDF
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
PDF
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
PDF
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Managed Search: Presented by Jacob Graves, Getty Images
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Simple Fuzzy Name Matching in Solr: Presented by Chris Mack, Basis Technology
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark

What's hot (17)

PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
KEY
State-of-the-Art Drupal Search with Apache Solr
PPTX
Measuring Search Engine Quality using Spark and Python
PPTX
Hacking Lucene for Custom Search Results
PDF
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PDF
Solr4 nosql search_server_2013
PDF
Solr Recipes
PDF
Apache Solr/Lucene Internals by Anatoliy Sokolenko
PDF
Webinar: Replace Google Search Appliance with Lucidworks Fusion
PPTX
Slash n near real time indexing
PDF
Data Science with Solr and Spark
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PDF
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
PPTX
Introduction to Apache Lucene/Solr
PPTX
Enterprise Search Using Apache Solr
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
State-of-the-Art Drupal Search with Apache Solr
Measuring Search Engine Quality using Spark and Python
Hacking Lucene for Custom Search Results
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Solr4 nosql search_server_2013
Solr Recipes
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Webinar: Replace Google Search Appliance with Lucidworks Fusion
Slash n near real time indexing
Data Science with Solr and Spark
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Introduction to Apache Lucene/Solr
Enterprise Search Using Apache Solr

Similar to Relevancy Hacks for eCommerce (20)

PPTX
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
PPTX
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
PDF
GetX Universal Search Tool for Sage 100
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
PDF
SDSC18 and DSATL Meetup March 2018
PPTX
How to Achieve Scale with MongoDB
PPT
Software Project Cost Estimation
PPTX
Test data generation
PDF
Enhancing relevancy through personalization & semantic search
PPTX
Metamorphic Testing for Machine Learning Models with Search Relevancy Example
PDF
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PPTX
Splunk bsides
PPTX
Relational data modeling trends for transactional applications
PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
PDF
Automated product categorization
PDF
Automated product categorization
PDF
Agile experiments in Machine Learning with F#
PPTX
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
PPTX
uw cse correct style and speed autumn 2020
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
GetX Universal Search Tool for Sage 100
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
SDSC18 and DSATL Meetup March 2018
How to Achieve Scale with MongoDB
Software Project Cost Estimation
Test data generation
Enhancing relevancy through personalization & semantic search
Metamorphic Testing for Machine Learning Models with Search Relevancy Example
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Splunk bsides
Relational data modeling trends for transactional applications
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Automated product categorization
Automated product categorization
Agile experiments in Machine Learning with F#
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
uw cse correct style and speed autumn 2020

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
A Presentation on Artificial Intelligence
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A comparative analysis of optical character recognition models for extracting...
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Assigned Numbers - 2025 - Bluetooth® Document
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Relevancy Hacks for eCommerce

  • 2. RELEVANCY HACKS FOR ECOMMERCE VARUN THACKER ! @VARUNTHACKER
  • 3. AGENDA • • • • • How to solve multiple eCommerce use cases by using the features present in Solr Query Parsing Building on the TF-IDF scoring model and improving it for your data set Adding relevancy signals to your score to rank documents better Customising search results on a per query basis
  • 4. HOW DO QUERIES SCORE DOCUMENTS? • Example document: { “title” : ”LG Nexus 5”, “brand” : ”LG”, “category” : “Smartphones” “tags” : “phones, android, touch” } • query = LG Nexus
  • 5. HOW DO QUERIES SCORE DOCUMENTS? • • • • Scores are field relative. I want a Query which will match against all the fields for each token. Approach 1: Use a BooleanQuery • Query query1 = new TermQuery(new Term("title", “lg")); • Query query2 = new TermQuery(new Term("title", "nexus")); • Query query3 = new TermQuery(new Term("brand", "lg")); • Query query4 = new TermQuery(new Term("brand", "nexus")); • Add all the queries into a BooleanQuery • Score = query1 + query2 + query3 + query4 • This would add the match for "lg" twice. Approach 2: Use DisjunctionMaxQuery - It automatically scores each document with the maximum score for that document as produced by any subquery
  • 6. DEFAULT SIMILARITY FACTORS • • • • • TF - number of occurrences of the term in the document. IDF - Is a measure of how unique or rare the term is. Normalisation's - Both at index time and at query time Coordination factor - number of matches of the query term in each document These statistics are per field
  • 7. WHY THE DEFAULT SCORING MAY NOT WORK? • • TF-IDF is calculated per field. Lets take term frequency first: • Product 1: iPad Air • Product 2: iPad Air case. Works well with iPad 3 and iPad 2 • query = iPad • Product 2 would rank before Product 1 • But obviously this is not what the user would be looking for • Does iPad occurring multiple times make it more important? • Idea - Let’s make TF = 1 for a token match
  • 8. WHY THE DEFAULT SCORING MAY NOT WORK? • Tackling Inverse Document Frequency • Product 1 - brown jacket • Product 2 - leather jacket • q = brown leather jacket • IDF is Not a measure of usefulness but a measure of rarity. • Should IDF from your corpus be the true judge on whether “leather” is more important than “brown” • Maybe you stock less brown jackets but it doesn’t mean that it is more important than a leather jacket. • Combine data of many stores in your vertical and compute the IDF score offline • Feed it back into your Custom Similarity implementation
  • 9. WHY THE DEFAULT SCORING MAY NOT WORK? • The "tie" factor between two documents with the same number of term matches is "fieldNorm". This means the document which contains lesser number of tokens.
  • 10. FUNCTION QUERIES • • • FunctionQuery allows one to use the actual value of a field and functions of those fields in a relevancy score. It iterates over all documents serially applying the function Can be multiplied into the score by using the boost param in the eDismax request handler
  • 11. INCLUDE POPULARITY DATA • • • • • Popularity could be anything - Maximum selling items, Highest viewed products, trending etc. Compute the "popularity" score offline for each document in the index. Stick them into the document if your data set is small else you could use a ExternalFileField Use a function query: • &boost= multiple popularity score value * score With the new expressions module coming in Lucene 4.6 it’s fairly simple to add multiple signals into your ranking formula • Expression expr = JavascriptCompiler.compile("_score + ln(popularity) + ln(margin)");
  • 12. ADDING CLICK THROUGH DATA • • Use this on a per query basis or a set of similar queries. We used function queries which take • id’s and their associated boost ! • An external application would enable the function query depending on the search query
  • 13. BOOSTING NEWER PRODUCTS • • Blindly sort the result • &sort = release_date desc Give preference to Newer Products • recip(ms(NOW/DAY,pub_date),3.16e-11,1,1) • Where recip(m, x, a, b) = a / (mx + b) • Picking a=2, b =1, m = 3.16e-11 • Gives a boost =2 for todays product • Gives a boost =1.3 for 1/2 year old product • Gives a boost =1 for 1 year old product and so on
  • 14. I'M STILL NOT SATISFIED! • • • Take your top N queries and use the QueryElevationComponent :) Fix particular documents for certain queries No scoring is taken into consideration for these queries ! <elevate> <query text="android phones"> <doc id="nexus 4" /> <doc id="iPhone" exclude="true"/> </query> </elevate>