SlideShare a Scribd company logo
Retrieving
Information from Solr
JOSA Data Science Bootcamp
● Head of Technology @
OpenSooq.com
● Technical Reviewer for
“Scaling Apache Solr” and
“Apache Solr Search Patterns”
(Books)
● Contributor in Apache Solr
● Built 10 search engines in the
last 2 years
Ramzi Alqrainy
Topics to be covered
● Exploring Solr’s Query Form
● Basic Queries and Parameters
● Matching Multiple Terms
● Fuzzy Matching
● Range Searches
● Sorting
● Pseudo Fields
● Geospatial Searches
● Filter Queries
● Faceting and Stats
● Tuning Relevance
Detailed
Architectural
Diagram
Basic Queries and Parameters
Exploring Solr’s
Query Form
Basic Queries and Parameters
Matching Multiple Terms
Boolean Queries
● Search for two different terms, new and house,requiring both to
match
● Search for two different terms, new and house, requiring only one
to match
● Default operator is OR, can be changed using the q.op query
parameter.
Negation
● Exclude documents containing specific terms
Inverted Index—Revisited
● All terms in the index map to 1 or more documents.
● Terms in inverted index are stored in ascending
lexicographical order
● When searching for multiple terms/ expressions, Solr (and
Lucene) returns multiple document result sets corresponding
to the various terms in the query and then does the specified
binary operations on these result sets in order to generate the
final result set.
● Scoring is performed on the result set o generate final result
Grouped Expressions
● Represent arbitrarily complex queries
Exact Phrase Queries
● Search for exact phrase “new house”
● Can Combine with Boolean Queries
Proximity Searches
● Represent arbitrarily complex queries
● Solr/Lucene not only stores the documents that contain the
terms, but also their positions within a document (term
positions), which is used to provide phrase and proximity
search functionality
● The number of the “~” is called a slop factor and has a hard
limit of 2, above which the number of permutations get too
large to provide results within a reasonable time
Fuzzy Matching
Fuzzy Edit-Distance Searching
● Flexibility to handle misspellings and different spellings of a
word
● Character variations based on Damerau-Levenshtein
distances
● Accounts for 80% of human misspellings
Wildcard Matching
● Robust functionality, but can be expensive if not properly
used.
○ First all terms that match parts of the term before wildcard expression are
extracted
○ Then all those terms are inspected to see if they match the entire wildcard
expression
○ Expensive if your expression matches a large number of terms (for
example the query e*)
Range Searches
Query on a Range
● Solr Date Time uses a format that is a restricted form of the canonical
representation of dateTime in the XML Schema specification (inspired by ISO
8601). All times are assumed to be UTC (no timezone specification)
● Based on a lexicographically sorted order for the field being queried
● Solr has Trie field types (tint, tdate, etc.) that should be used when you are
doing a large number of range queries
● Various field types will be covered later in the course
Solr Date Syntax
● Uses UTC and Restricted DateTime format
● Allows rounding down by YEAR, MONTH, WEEK, DAY,
MINUTE, SECOND
● NOW represents current time and using DateMath, we can
specify yesterday, tomorrow, last year, etc.
Sorting
Sorting
● Sort by score
● Values of Fields
● Ascending or Descending
● Multiple Fields
Pseudo Fields
● Dynamically added at query time and calculated from fields in
the schema using in- built functions
● Through functions, you can manipulate the values of any field
before it is returned
● Can also be used to modify the order of documents by sorting
on the pseudo field
Geospatial Searches
Geospatial Searches
● Solr provides location-based search
● Define a “location” field that contains latitude and longitude
● You can use a Query parser called “geofilt” to search on this
field, specifying the point and radius around it
● Another query parser bbox uses a square around the point to
do faster but approximate calculations
● Other types of searches (grids, polygons, etc. are possible
and covered in advanced course
Returning Calculated Distances
● You can use a pseudo field (a field that is calculated at query
time) to achieve this
Filter Queries
The fq and q Parameters
● Indistinguishable at first glance: same query parameters passed to either
parameters will return same documents.
● But,
○ fq serves a single purpose, to limit what is returned
○ q limits what is returned AND supplies the relevancy algorithm with a set
of terms used for scoring
● fq results are cached and can be reused between searches
● Using fq we can avoid unnecessary relevancy calculations
● You can use multiple fq’s in a request (each individually cached), but only one q
parameter
Faceting and Stats
Faceted Search
● High-level breakdown of search
results based on one or more
aspects (facets) of their
documents
● Allows users to filter by (drill
down into) specific components
● Can facet on values of fields, or
facet by queries
Types of Facet
● Field Facets
● Range Facets
● Pivot Facets
Field Faceting
● Request back the unique values
found in a particular field
● Most commonly used
● Works for single- and multi-valued
fields
● Values are based on the indexed
values of the field
● Common practice is to facet on a
String field and search on a text field
(to be discussed later). So, some
schema preparation is required for
faceting
Range Facet
● Divide a range into equal size buckets
Range Faceting
Date Range Facets
● Recall Solr Date Syntax covered earlier in class
● Uses UTC and Restricted DateTime format
● Allows rounding down by YEAR, MONTH, WEEK, DAY,
MINUTE, SECOND
● NOW represents current time and using DateMath, we can
specify yesterday, tomorrow, last year, etc.
Stats and Facets
● Can get aggregations on various fields
● From Solr 5.x onwards, stats on pivot facets is also available
● See https://lucidworks.com/blog/you-got-stats-in-my-facets/ for
a great explanation of faceting
Pivot Facets
● Functions like pivot tables in spreadsheet apps
● Aggregate calculations that pivot on values from multiple fields
● Example: give me a count of 3,4 and 5 star hotels in the top
three cities
● Solr 5.x also allows you to stats calculations on pivots
Facet by Query
● Sometimes, you need unequal ranges
● You can use the facet.query parameter
● Provides counts for subqueries
Tuning Relevance
Precision and recall
Precision and recall
Are the top results we show to users relevant?
Recall
Of the full set of documents found, have we found all of the
relevant content in the index?
Relevancy
Our goal is to give users relevant results Relevance is a soft or fuzzy thing
● Depends upon the judgment of users
Scoring is our attempt to predict relevance
Similarity classes hold the implementations
● DefaultSimilarity ( TF-IDF )
● BM25Similarity
● DFRSimilarity
● IBSimilarity
● LMDirichletSimilarity
● LMJelinekMercerSimilarity
Lucene Scoring
Similarity scoring formula
• Used to rank results by measuring the similarity between a query
and the documents that match the query
Domain knowledge
Examples
● Cheaper
● Newer or more recent
● More popular or higher user clicks Higher average user ratings
Interesting combinations
● Value = average user ratings ÷ price
● Staying power = recent popularity ÷ age
Boosting and biasing
Lucene uses a standardized scoring approach
Lucene does not know:
● Your data
● Your users
● Their queries Their preferences
Domain knowledge
What do you know about your data?
● Any specific rules about your data that wouldn't be suitable in
a generic IR scoring algorithm
● In many data domains, there are fundamental numeric
properties that make some objects generally "better" than
others
Domain knowledge
More subtle examples
● Novelty factor
○ Quantity of user ratings × stdDev of ratings Profit margin
● Profit margin
○ Retail price ‒ factory cost Scarcity
● Scarcity
○ Quantity remaining
● Popularity by association or categorization
○ Sweaters sell better then swimsuits in November
● Manual ranking
○ New York Times bestseller list
Request parameters
We are going to make substantial use of request parameters, so
let's recap:
How can you improve search results?
Using a sledge hammer
● Ignore score, sort on X
● Filter by X, retry if 0 results
How can you improve search results?
● Boost functions and queries
● Apply domain knowledge based on numeric properties by
multiplying functions directly into the score
Retrieving
Information from Solr
JOSA Data Science Bootcamp

More Related Content

PPTX
20130310 solr tuorial
PDF
Apache Solr crash course
PPTX
Apache Solr
PDF
Get the most out of Solr search with PHP
PPTX
Introduction to Apache Lucene/Solr
PPTX
Introduction to Apache Solr
PDF
Integrating the Solr search engine
PDF
Beyond full-text searches with Lucene and Solr
20130310 solr tuorial
Apache Solr crash course
Apache Solr
Get the most out of Solr search with PHP
Introduction to Apache Lucene/Solr
Introduction to Apache Solr
Integrating the Solr search engine
Beyond full-text searches with Lucene and Solr

What's hot (20)

PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PDF
Data Science with Solr and Spark
PDF
Solr Recipes Workshop
PDF
Solr Application Development Tutorial
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PPTX
Apache Solr
PDF
Using Apache Solr
PDF
Introduction to Apache Solr
PPTX
Introduction to Lucene & Solr and Usecases
PDF
Introduction to Solr
PPTX
Solr 6 Feature Preview
PDF
How Solr Search Works
PDF
Search Engine-Building with Lucene and Solr
PDF
Solr Troubleshooting - TreeMap approach
PDF
Apache Solr Workshop
PDF
Apache Solr! Enterprise Search Solutions at your Fingertips!
PPTX
Tutorial on developing a Solr search component plugin
PDF
Solr: 4 big features
PPTX
ElasticSearch AJUG 2013
PPTX
Ingesting and Manipulating Data with JavaScript
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Data Science with Solr and Spark
Solr Recipes Workshop
Solr Application Development Tutorial
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Apache Solr
Using Apache Solr
Introduction to Apache Solr
Introduction to Lucene & Solr and Usecases
Introduction to Solr
Solr 6 Feature Preview
How Solr Search Works
Search Engine-Building with Lucene and Solr
Solr Troubleshooting - TreeMap approach
Apache Solr Workshop
Apache Solr! Enterprise Search Solutions at your Fingertips!
Tutorial on developing a Solr search component plugin
Solr: 4 big features
ElasticSearch AJUG 2013
Ingesting and Manipulating Data with JavaScript
Ad

Viewers also liked (20)

PDF
Евгений Ильин. Drupal + Solr: Яндекс.Маркет своими руками
PPTX
Интеграция ЭБС с АБИС вузов. Работа с записями RusMARC
PPTX
PDF
Webinar: Solr's example/files: From bin/post to /browse and Beyond
PPTX
Apache solr
PDF
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
PDF
Scaling search to a million pages with Solr, Python, and Django
PPT
Curso Formacion Apache Solr
PDF
Apache Solr Search Course Drupal 7 Acquia
PDF
Solr Powered Lucene
PDF
Webinar: Natural Language Search with Solr
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
PDF
Seminario Apache Solr
PPTX
Formación apache Solr
PPT
Introduction to Apache Solr.
PDF
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
PDF
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
PDF
Managed Search: Presented by Jacob Graves, Getty Images
PPTX
What's new in solr june 2014
Евгений Ильин. Drupal + Solr: Яндекс.Маркет своими руками
Интеграция ЭБС с АБИС вузов. Работа с записями RusMARC
Webinar: Solr's example/files: From bin/post to /browse and Beyond
Apache solr
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Scaling search to a million pages with Solr, Python, and Django
Curso Formacion Apache Solr
Apache Solr Search Course Drupal 7 Acquia
Solr Powered Lucene
Webinar: Natural Language Search with Solr
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Seminario Apache Solr
Formación apache Solr
Introduction to Apache Solr.
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Understand the Breadth and Depth of Solr via the Admin UI: Presented by Upaya...
Managed Search: Presented by Jacob Graves, Getty Images
What's new in solr june 2014
Ad

Similar to Retrieving Information From Solr (20)

PDF
Sunspot - The Ruby Way into Solr
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PPTX
Solr Introduction
PDF
Find it, possibly also near you!
PDF
A Practical Introduction to Apache Solr
PDF
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PPTX
Apache solr
PPTX
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...
PDF
Solr Masterclass Bangkok, June 2014
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
PDF
Solr Architecture
PDF
Solr 3.1 and beyond
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PPTX
Solr introduction
PPTX
Make Your Data Searchable With Solr in 25 Minutes
PDF
Apace Solr Web Development.pdf
PPTX
Open Source Search FTW
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PDF
Information Retrieval - Data Science Bootcamp
Sunspot - The Ruby Way into Solr
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Solr Introduction
Find it, possibly also near you!
A Practical Introduction to Apache Solr
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Apache solr
DrupalTour. Lviv — Apache solr. Advanced use cases (Artem Sylchuk, InternetDe...
Solr Masterclass Bangkok, June 2014
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Solr Architecture
Solr 3.1 and beyond
Building Intelligent Search Applications with Apache Solr and PHP5
Solr introduction
Make Your Data Searchable With Solr in 25 Minutes
Apace Solr Web Development.pdf
Open Source Search FTW
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Information Retrieval - Data Science Bootcamp

More from Ramzi Alqrainy (20)

PDF
Non English Search as a Machine Learning Problem
PDF
OpenSooq Image Recognition on AWS - AWS ML Lab
PDF
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
PDF
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
PDF
Infrastructure OpenSooq Mobile @ Scale
PDF
Choosing the Right Technologies for OpenSooq
PDF
PDF
Arabic Content with Apache Solr
PDF
Recommender Systems, Part 1 - Introduction to approaches and algorithms
PDF
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
PDF
Evaluating Search Engines
PDF
Starting From Zero - Winning Strategies for Zero Results Page
PDF
Search Behavior Patterns
PPT
Intel microprocessor history
PDF
How to prevent the cache problem in AJAX
PPTX
Linked stacks and queues
PDF
Advance Data Structure
PDF
PPT
Markov Matrix
PPT
Non English Search as a Machine Learning Problem
OpenSooq Image Recognition on AWS - AWS ML Lab
A Few Milliseconds in the Life of an HTTP Request - AWS Summit 2019
Mastering Chaos - OpenSooq’s journey from Monolithic to Microservices
Infrastructure OpenSooq Mobile @ Scale
Choosing the Right Technologies for OpenSooq
Arabic Content with Apache Solr
Recommender Systems, Part 1 - Introduction to approaches and algorithms
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Evaluating Search Engines
Starting From Zero - Winning Strategies for Zero Results Page
Search Behavior Patterns
Intel microprocessor history
How to prevent the cache problem in AJAX
Linked stacks and queues
Advance Data Structure
Markov Matrix

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
Teaching material agriculture food technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Teaching material agriculture food technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Monthly Chronicles - July 2025

Retrieving Information From Solr

  • 2. ● Head of Technology @ OpenSooq.com ● Technical Reviewer for “Scaling Apache Solr” and “Apache Solr Search Patterns” (Books) ● Contributor in Apache Solr ● Built 10 search engines in the last 2 years Ramzi Alqrainy
  • 3. Topics to be covered ● Exploring Solr’s Query Form ● Basic Queries and Parameters ● Matching Multiple Terms ● Fuzzy Matching ● Range Searches ● Sorting ● Pseudo Fields ● Geospatial Searches ● Filter Queries ● Faceting and Stats ● Tuning Relevance
  • 5. Basic Queries and Parameters
  • 7. Basic Queries and Parameters
  • 9. Boolean Queries ● Search for two different terms, new and house,requiring both to match ● Search for two different terms, new and house, requiring only one to match ● Default operator is OR, can be changed using the q.op query parameter.
  • 10. Negation ● Exclude documents containing specific terms
  • 11. Inverted Index—Revisited ● All terms in the index map to 1 or more documents. ● Terms in inverted index are stored in ascending lexicographical order ● When searching for multiple terms/ expressions, Solr (and Lucene) returns multiple document result sets corresponding to the various terms in the query and then does the specified binary operations on these result sets in order to generate the final result set. ● Scoring is performed on the result set o generate final result
  • 12. Grouped Expressions ● Represent arbitrarily complex queries
  • 13. Exact Phrase Queries ● Search for exact phrase “new house” ● Can Combine with Boolean Queries
  • 14. Proximity Searches ● Represent arbitrarily complex queries ● Solr/Lucene not only stores the documents that contain the terms, but also their positions within a document (term positions), which is used to provide phrase and proximity search functionality ● The number of the “~” is called a slop factor and has a hard limit of 2, above which the number of permutations get too large to provide results within a reasonable time
  • 16. Fuzzy Edit-Distance Searching ● Flexibility to handle misspellings and different spellings of a word ● Character variations based on Damerau-Levenshtein distances ● Accounts for 80% of human misspellings
  • 17. Wildcard Matching ● Robust functionality, but can be expensive if not properly used. ○ First all terms that match parts of the term before wildcard expression are extracted ○ Then all those terms are inspected to see if they match the entire wildcard expression ○ Expensive if your expression matches a large number of terms (for example the query e*)
  • 19. Query on a Range ● Solr Date Time uses a format that is a restricted form of the canonical representation of dateTime in the XML Schema specification (inspired by ISO 8601). All times are assumed to be UTC (no timezone specification) ● Based on a lexicographically sorted order for the field being queried ● Solr has Trie field types (tint, tdate, etc.) that should be used when you are doing a large number of range queries ● Various field types will be covered later in the course
  • 20. Solr Date Syntax ● Uses UTC and Restricted DateTime format ● Allows rounding down by YEAR, MONTH, WEEK, DAY, MINUTE, SECOND ● NOW represents current time and using DateMath, we can specify yesterday, tomorrow, last year, etc.
  • 22. Sorting ● Sort by score ● Values of Fields ● Ascending or Descending ● Multiple Fields
  • 23. Pseudo Fields ● Dynamically added at query time and calculated from fields in the schema using in- built functions ● Through functions, you can manipulate the values of any field before it is returned ● Can also be used to modify the order of documents by sorting on the pseudo field
  • 25. Geospatial Searches ● Solr provides location-based search ● Define a “location” field that contains latitude and longitude ● You can use a Query parser called “geofilt” to search on this field, specifying the point and radius around it ● Another query parser bbox uses a square around the point to do faster but approximate calculations ● Other types of searches (grids, polygons, etc. are possible and covered in advanced course
  • 26. Returning Calculated Distances ● You can use a pseudo field (a field that is calculated at query time) to achieve this
  • 28. The fq and q Parameters ● Indistinguishable at first glance: same query parameters passed to either parameters will return same documents. ● But, ○ fq serves a single purpose, to limit what is returned ○ q limits what is returned AND supplies the relevancy algorithm with a set of terms used for scoring ● fq results are cached and can be reused between searches ● Using fq we can avoid unnecessary relevancy calculations ● You can use multiple fq’s in a request (each individually cached), but only one q parameter
  • 30. Faceted Search ● High-level breakdown of search results based on one or more aspects (facets) of their documents ● Allows users to filter by (drill down into) specific components ● Can facet on values of fields, or facet by queries
  • 31. Types of Facet ● Field Facets ● Range Facets ● Pivot Facets
  • 32. Field Faceting ● Request back the unique values found in a particular field ● Most commonly used ● Works for single- and multi-valued fields ● Values are based on the indexed values of the field ● Common practice is to facet on a String field and search on a text field (to be discussed later). So, some schema preparation is required for faceting
  • 33. Range Facet ● Divide a range into equal size buckets
  • 35. Date Range Facets ● Recall Solr Date Syntax covered earlier in class ● Uses UTC and Restricted DateTime format ● Allows rounding down by YEAR, MONTH, WEEK, DAY, MINUTE, SECOND ● NOW represents current time and using DateMath, we can specify yesterday, tomorrow, last year, etc.
  • 36. Stats and Facets ● Can get aggregations on various fields ● From Solr 5.x onwards, stats on pivot facets is also available ● See https://lucidworks.com/blog/you-got-stats-in-my-facets/ for a great explanation of faceting
  • 37. Pivot Facets ● Functions like pivot tables in spreadsheet apps ● Aggregate calculations that pivot on values from multiple fields ● Example: give me a count of 3,4 and 5 star hotels in the top three cities ● Solr 5.x also allows you to stats calculations on pivots
  • 38. Facet by Query ● Sometimes, you need unequal ranges ● You can use the facet.query parameter ● Provides counts for subqueries
  • 40. Precision and recall Precision and recall Are the top results we show to users relevant? Recall Of the full set of documents found, have we found all of the relevant content in the index?
  • 41. Relevancy Our goal is to give users relevant results Relevance is a soft or fuzzy thing ● Depends upon the judgment of users Scoring is our attempt to predict relevance Similarity classes hold the implementations ● DefaultSimilarity ( TF-IDF ) ● BM25Similarity ● DFRSimilarity ● IBSimilarity ● LMDirichletSimilarity ● LMJelinekMercerSimilarity
  • 42. Lucene Scoring Similarity scoring formula • Used to rank results by measuring the similarity between a query and the documents that match the query
  • 43. Domain knowledge Examples ● Cheaper ● Newer or more recent ● More popular or higher user clicks Higher average user ratings Interesting combinations ● Value = average user ratings ÷ price ● Staying power = recent popularity ÷ age
  • 44. Boosting and biasing Lucene uses a standardized scoring approach Lucene does not know: ● Your data ● Your users ● Their queries Their preferences
  • 45. Domain knowledge What do you know about your data? ● Any specific rules about your data that wouldn't be suitable in a generic IR scoring algorithm ● In many data domains, there are fundamental numeric properties that make some objects generally "better" than others
  • 46. Domain knowledge More subtle examples ● Novelty factor ○ Quantity of user ratings × stdDev of ratings Profit margin ● Profit margin ○ Retail price ‒ factory cost Scarcity ● Scarcity ○ Quantity remaining ● Popularity by association or categorization ○ Sweaters sell better then swimsuits in November ● Manual ranking ○ New York Times bestseller list
  • 47. Request parameters We are going to make substantial use of request parameters, so let's recap:
  • 48. How can you improve search results? Using a sledge hammer ● Ignore score, sort on X ● Filter by X, retry if 0 results
  • 49. How can you improve search results? ● Boost functions and queries ● Apply domain knowledge based on numeric properties by multiplying functions directly into the score
  • 50. Retrieving Information from Solr JOSA Data Science Bootcamp