How the Lucene More Like This Works

Apache Lucene/Solr London User Group
How the Lucene More Like
This Works
Alessandro Benedetti, Software Engineer
16th May 2019

Apache Lucene/Solr London User GroupWho I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti

Apache Lucene/Solr London User GroupSease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning

● Document Similarity
● Apache Lucene More Like This
! Term Scorer
! BM25
● Interesting Terms Retrieval
● Query Building
! DEMO
! Future Work
! JIRA References
Agenda

Document Similarity
Problem : find similar documents to a seed one
Solution(s) : 
● Collaborative approach  
(users interactions)
● Content Based
● Hybrid
Similar ?  
● Documents accessed in
similar manners by similar
people
● Terms distributions
● All of above

Real World Use Cases - Streaming Services

Real World Use Cases - Hotels

Apache Lucene
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.

● Search Library (java)
● Structured Documents
! Inverted Index
! Similarity Metrics ( TF-IDF, BM25)
! Fast Search
! Support for advanced queries
! Relevancy tuning
Apache Lucene

Inverted Index
Indexing

Input
Document More Like This
Params
Interesting
Terms 
Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up

Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
! Useful container for various parameters to be passed
More Like This Params

● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
! Term Frequency
! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
! BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input

! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document Length
[1] LUCENE-6789
BM25 Term Scorer

BM25 Term Scorer - Inverse Document Frequency
IDF Score 
has very similar
behavior

BM25 Term Scorer - Term Frequency
TF Score 
approaches 
asymptotically (k+1) 
 
k=1.2 in this
example

BM25 Term Scorer - Document Length
Document Length / 
Avg Document
Length 
 
affects how fast we
saturate TF score

Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
! Analyzer
! Max Num Token Parsed
! Min Term Frequency
! Min/Max Document Frequency
! Max Query Terms
! Query Time Field Boost
Interesting Term Retriever
! Analyze content / Term Vector
! Skip Tokens
! Score Tokens
! Build Queue of Top Scored terms

Params Used
! Term Boost Enabled
More Like This Query Builder
Field1 : 
Term1
Field2 : 
Term2
Field1 : 
Term3
Field1 : 
Term4
Field3 : 
Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5

Term Boost
! on/off
! Affect each term weight in the
MLT query
! It is the term score  
( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
! field1^5.0 field2^2.0 field3^1.5
! Affect Term Scorer
! Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval

More Like This Usage - Lucene Classification
! Given a document D to classify
! K Nearest Neighbours Classifier
! Find Top K similar documents to D ( MLT)
! Classes are extracted
! Class Frequency + Class ranking -> Class probability

More Like This Usage - Apache Solr
! More Like This query parser 
( can be concatenated with other queries)
! More Like This search component 
( can be assigned to a Request Handler)
! More Like This handler 
( handler with specific request parameters)

More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● Title - Name of the movie
● Directors - The person(s) who directed the making of the film
● Genres - The genre(s) that the movie belongs to

More Like This Demo - Tuned
! Enable/Disable Term Boost
! Min Term Frequency
! Min Document Frequency
! Field Boost

Future Work
! Query Builder just use Terms and Term Score
! Term Positions ?
! Phrase Queries Boost 
(for terms close in position)
! Sentence boundaries
! Field centric vs Document centric 
( should high boosted fields kick out 
relevant terms from low boosted fields)

Future Work - More Like These
! Multiple documents in input
! Interesting terms across
documents
● Useful for Content Based
recommender engines

Pros
● Apache Lucene Module
! Advanced Params
! Input :  
- structured document 
- just text
! Build an advanced query
! Leverage the Inverted Index 
( and additional data structures)
More Like This
Cons
● Massive single class
! Low cohesion
! Low readability
! Minimum test coverage
! Difficult to extend 
( and improve)

● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
● LUCENE-8326 - MLT Params Refactor
JIRA References

Questions ?

Apache Lucene/Solr London User GroupThanks!

How the Lucene More Like This Works

More Related Content

What's hot (20)

Similar to How the Lucene More Like This Works (20)

More from Sease (20)

Recently uploaded (20)

How the Lucene More Like This Works