SlideShare a Scribd company logo
Apache Lucene/Solr London User Group
How the Lucene More Like
This Works
Alessandro Benedetti, Software Engineer
16th May 2019
Apache Lucene/Solr London User GroupWho I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Apache Lucene/Solr London User GroupSease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning
Apache Lucene/Solr London User Group
● Document Similarity
● Apache Lucene More Like This
! Term Scorer
! BM25
● Interesting Terms Retrieval
● Query Building
! DEMO
! Future Work
! JIRA References
Agenda
Apache Lucene/Solr London User Group
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :

● Collaborative approach 

(users interactions)
● Content Based
● Hybrid
Similar ? 

● Documents accessed in
similar manners by similar
people
● Terms distributions
● All of above
Apache Lucene/Solr London User Group
Real World Use Cases - Streaming Services
Apache Lucene/Solr London User Group
Real World Use Cases - Hotels
Apache Lucene/Solr London User Group
Apache Lucene
Apache LuceneTM is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
Apache Lucene/Solr London User Group
● Search Library (java)
● Structured Documents
! Inverted Index
! Similarity Metrics ( TF-IDF, BM25)
! Fast Search
! Support for advanced queries
! Relevancy tuning
Apache Lucene
Apache Lucene/Solr London User Group
Inverted Index
Indexing
Apache Lucene/Solr London User Group
Input
Document More Like This
Params
Interesting
Terms

Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up
Apache Lucene/Solr London User Group
Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
! Useful container for various parameters to be passed
More Like This Params
Apache Lucene/Solr London User Group
● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
! Term Frequency
! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
! BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input
Apache Lucene/Solr London User Group
! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document Length
[1] LUCENE-6789
BM25 Term Scorer
Apache Lucene/Solr London User Group
BM25 Term Scorer - Inverse Document Frequency
IDF Score

has very similar
behavior
Apache Lucene/Solr London User Group
BM25 Term Scorer - Term Frequency
TF Score

approaches

asymptotically (k+1)



k=1.2 in this
example
Apache Lucene/Solr London User Group
BM25 Term Scorer - Document Length
Document Length /

Avg Document
Length



affects how fast we
saturate TF score
Apache Lucene/Solr London User Group
Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
! Analyzer
! Max Num Token Parsed
! Min Term Frequency
! Min/Max Document Frequency
! Max Query Terms
! Query Time Field Boost
Interesting Term Retriever
! Analyze content / Term Vector
! Skip Tokens
! Score Tokens
! Build Queue of Top Scored terms
Apache Lucene/Solr London User Group
Params Used
! Term Boost Enabled
More Like This Query Builder
Field1 :

Term1
Field2 :

Term2
Field1 :

Term3
Field1 :

Term4
Field3 :

Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5
Apache Lucene/Solr London User Group
Term Boost
! on/off
! Affect each term weight in the
MLT query
! It is the term score 

( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
! field1^5.0 field2^2.0 field3^1.5
! Affect Term Scorer
! Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval
Apache Lucene/Solr London User Group
More Like This Usage - Lucene Classification
! Given a document D to classify
! K Nearest Neighbours Classifier
! Find Top K similar documents to D ( MLT)
! Classes are extracted
! Class Frequency + Class ranking -> Class probability
Apache Lucene/Solr London User Group
More Like This Usage - Apache Solr
! More Like This query parser

( can be concatenated with other queries)
! More Like This search component

( can be assigned to a Request Handler)
! More Like This handler

( handler with specific request parameters)
Apache Lucene/Solr London User Group
More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● Title - Name of the movie
● Directors - The person(s) who directed the making of the film
● Genres - The genre(s) that the movie belongs to
Apache Lucene/Solr London User Group
More Like This Demo - Tuned
! Enable/Disable Term Boost
! Min Term Frequency
! Min Document Frequency
! Field Boost
Apache Lucene/Solr London User Group
Future Work
! Query Builder just use Terms and Term Score
! Term Positions ?
! Phrase Queries Boost

(for terms close in position)
! Sentence boundaries
! Field centric vs Document centric

( should high boosted fields kick out

relevant terms from low boosted fields)
Apache Lucene/Solr London User Group
Future Work - More Like These
! Multiple documents in input
! Interesting terms across
documents
● Useful for Content Based
recommender engines
Apache Lucene/Solr London User Group
Pros
● Apache Lucene Module
! Advanced Params
! Input : 

- structured document

- just text
! Build an advanced query
! Leverage the Inverted Index

( and additional data structures)
More Like This
Cons
● Massive single class
! Low cohesion
! Low readability
! Minimum test coverage
! Difficult to extend

( and improve)
Apache Lucene/Solr London User Group
● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
● LUCENE-8326 - MLT Params Refactor
JIRA References
Apache Lucene/Solr London User Group
Questions ?
Apache Lucene/Solr London User GroupThanks!

More Related Content

PDF
Tweaking the Base Score: Lucene/Solr Similarities Explained
PPTX
Apache Spark
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
The Kubernetes Operator Pattern - ContainerConf Nov 2017
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
PDF
Apache OpenWhiskで実現するプライベートFaaS環境 #tjdev
PPTX
Introduction to Elasticsearch with basics of Lucene
Tweaking the Base Score: Lucene/Solr Similarities Explained
Apache Spark
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
The Kubernetes Operator Pattern - ContainerConf Nov 2017
Aggregated queries with Druid on terrabytes and petabytes of data
Apache OpenWhiskで実現するプライベートFaaS環境 #tjdev
Introduction to Elasticsearch with basics of Lucene

What's hot (20)

PDF
[2D1]Elasticsearch 성능 최적화
PPTX
Apache Arrow: In Theory, In Practice
PPTX
Elasticsearch
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
Aurora MySQL Backtrack을 이용한 빠른 복구 방법 - 진교선 :: AWS Database Modernization Day 온라인
PPTX
Neural Search Comes to Apache Solr
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
PDF
Building Better Data Pipelines using Apache Airflow
PDF
ELK Stack
PPTX
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
PPTX
Alfresco tuning part2
PDF
Introduction to elasticsearch
PPTX
How to Actually Tune Your Spark Jobs So They Work
PDF
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
PPTX
Deep Dive into Apache Kafka
PDF
Oci object storage deep dive 20190329 ss
PPTX
Apache airflow
PPTX
/path/to/content - the Apache Jackrabbit content repository
PPTX
[2D1]Elasticsearch 성능 최적화
Apache Arrow: In Theory, In Practice
Elasticsearch
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Aurora MySQL Backtrack을 이용한 빠른 복구 방법 - 진교선 :: AWS Database Modernization Day 온라인
Neural Search Comes to Apache Solr
HBase and HDFS: Understanding FileSystem Usage in HBase
Berlin Buzzwords 2013 - How does lucene store your data?
Building Better Data Pipelines using Apache Airflow
ELK Stack
Apache Spark on Kubernetes入門(Open Source Conference 2021 Online Hiroshima 発表資料)
Alfresco tuning part2
Introduction to elasticsearch
How to Actually Tune Your Spark Jobs So They Work
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Deep Dive into Apache Kafka
Oci object storage deep dive 20190329 ss
Apache airflow
/path/to/content - the Apache Jackrabbit content repository
Ad

Similar to How the Lucene More Like This Works (20)

PDF
Advanced Document Similarity With Apache Lucene
PDF
Advanced Document Similarity with Apache Lucene
PPTX
Introduction to Apache Lucene/Solr
PDF
Lucene for Solr Developers
PDF
Lucene for Solr Developers
PPTX
PPTX
Introduction to Lucene & Solr and Usecases
PPT
Lucene basics
PPTX
Search Me: Using Lucene.Net
PPTX
Apache lucene
PDF
Search Engine-Building with Lucene and Solr
PDF
IR with lucene
PPT
Lucene Bootcamp -1
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PPTX
Illuminating Lucene.Net
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PDF
A Practical Introduction to Apache Solr
PDF
Apace Solr Web Development.pdf
PDF
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity with Apache Lucene
Introduction to Apache Lucene/Solr
Lucene for Solr Developers
Lucene for Solr Developers
Introduction to Lucene & Solr and Usecases
Lucene basics
Search Me: Using Lucene.Net
Apache lucene
Search Engine-Building with Lucene and Solr
IR with lucene
Lucene Bootcamp -1
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Illuminating Lucene.Net
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
A Practical Introduction to Apache Solr
Apace Solr Web Development.pdf
Ad

More from Sease (20)

PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
PPTX
Hybrid Search with Apache Solr Reciprocal Rank Fusion
PPTX
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
PPTX
From Natural Language to Structured Solr Queries using LLMs
PPTX
Hybrid Search With Apache Solr
PPTX
Multi Valued Vectors Lucene
PPTX
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PDF
Introducing Multi Valued Vectors Fields in Apache Lucene
PPTX
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
PPTX
How does ChatGPT work: an Information Retrieval perspective
PDF
How To Implement Your Online Search Quality Evaluation With Kibana
PPTX
Large Scale Indexing
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
PDF
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
PPTX
How to cache your searches_ an open source implementation.pptx
PDF
Online Testing Learning to Rank with Solr Interleaving
PDF
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Building Search Using OpenSearch: Limitations and Workarounds
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
From Natural Language to Structured Solr Queries using LLMs
Hybrid Search With Apache Solr
Multi Valued Vectors Lucene
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
How To Implement Your Online Search Quality Evaluation With Kibana
Introducing Multi Valued Vectors Fields in Apache Lucene
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
How does ChatGPT work: an Information Retrieval perspective
How To Implement Your Online Search Quality Evaluation With Kibana
Large Scale Indexing
Dense Retrieval with Apache Solr Neural Search.pdf
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
How to cache your searches_ an open source implementation.pptx
Online Testing Learning to Rank with Solr Interleaving
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Agricultural_Statistics_at_a_Glance_2022_0.pdf
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?

How the Lucene More Like This Works

  • 1. Apache Lucene/Solr London User Group How the Lucene More Like This Works Alessandro Benedetti, Software Engineer 16th May 2019
  • 2. Apache Lucene/Solr London User GroupWho I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Apache Lucene/Solr London User GroupSease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  • 4. Apache Lucene/Solr London User Group ● Document Similarity ● Apache Lucene More Like This ! Term Scorer ! BM25 ● Interesting Terms Retrieval ● Query Building ! DEMO ! Future Work ! JIRA References Agenda
  • 5. Apache Lucene/Solr London User Group Document Similarity Problem : find similar documents to a seed one Solution(s) :
 ● Collaborative approach 
 (users interactions) ● Content Based ● Hybrid Similar ? 
 ● Documents accessed in similar manners by similar people ● Terms distributions ● All of above
  • 6. Apache Lucene/Solr London User Group Real World Use Cases - Streaming Services
  • 7. Apache Lucene/Solr London User Group Real World Use Cases - Hotels
  • 8. Apache Lucene/Solr London User Group Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9. Apache Lucene/Solr London User Group ● Search Library (java) ● Structured Documents ! Inverted Index ! Similarity Metrics ( TF-IDF, BM25) ! Fast Search ! Support for advanced queries ! Relevancy tuning Apache Lucene
  • 10. Apache Lucene/Solr London User Group Inverted Index Indexing
  • 11. Apache Lucene/Solr London User Group Input Document More Like This Params Interesting Terms
 Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 12. Apache Lucene/Solr London User Group Responsibility : define a set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ! Useful container for various parameters to be passed More Like This Params
  • 13. Apache Lucene/Solr London User Group ● Field Name ● Field Stats ( Document Count) ● Term Stats ( Document Frequency) ! Term Frequency ! TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ! BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 14. Apache Lucene/Solr London User Group ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 15. Apache Lucene/Solr London User Group BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  • 16. Apache Lucene/Solr London User Group BM25 Term Scorer - Term Frequency TF Score
 approaches
 asymptotically (k+1)
 
 k=1.2 in this example
  • 17. Apache Lucene/Solr London User Group BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 
 affects how fast we saturate TF score
  • 18. Apache Lucene/Solr London User Group Responsibility : retrieve from the document a queue of weighted interesting terms Params Used ! Analyzer ! Max Num Token Parsed ! Min Term Frequency ! Min/Max Document Frequency ! Max Query Terms ! Query Time Field Boost Interesting Term Retriever ! Analyze content / Term Vector ! Skip Tokens ! Score Tokens ! Build Queue of Top Scored terms
  • 19. Apache Lucene/Solr London User Group Params Used ! Term Boost Enabled More Like This Query Builder Field1 :
 Term1 Field2 :
 Term2 Field1 :
 Term3 Field1 :
 Term4 Field3 :
 Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 20. Apache Lucene/Solr London User Group Term Boost ! on/off ! Affect each term weight in the MLT query ! It is the term score 
 ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ! field1^5.0 field2^2.0 field3^1.5 ! Affect Term Scorer ! Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 21. Apache Lucene/Solr London User Group More Like This Usage - Lucene Classification ! Given a document D to classify ! K Nearest Neighbours Classifier ! Find Top K similar documents to D ( MLT) ! Classes are extracted ! Class Frequency + Class ranking -> Class probability
  • 22. Apache Lucene/Solr London User Group More Like This Usage - Apache Solr ! More Like This query parser
 ( can be concatenated with other queries) ! More Like This search component
 ( can be assigned to a Request Handler) ! More Like This handler
 ( handler with specific request parameters)
  • 23. Apache Lucene/Solr London User Group More Like This Demo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● Title - Name of the movie ● Directors - The person(s) who directed the making of the film ● Genres - The genre(s) that the movie belongs to
  • 24. Apache Lucene/Solr London User Group More Like This Demo - Tuned ! Enable/Disable Term Boost ! Min Term Frequency ! Min Document Frequency ! Field Boost
  • 25. Apache Lucene/Solr London User Group Future Work ! Query Builder just use Terms and Term Score ! Term Positions ? ! Phrase Queries Boost
 (for terms close in position) ! Sentence boundaries ! Field centric vs Document centric
 ( should high boosted fields kick out
 relevant terms from low boosted fields)
  • 26. Apache Lucene/Solr London User Group Future Work - More Like These ! Multiple documents in input ! Interesting terms across documents ● Useful for Content Based recommender engines
  • 27. Apache Lucene/Solr London User Group Pros ● Apache Lucene Module ! Advanced Params ! Input : 
 - structured document
 - just text ! Build an advanced query ! Leverage the Inverted Index
 ( and additional data structures) More Like This Cons ● Massive single class ! Low cohesion ! Low readability ! Minimum test coverage ! Difficult to extend
 ( and improve)
  • 28. Apache Lucene/Solr London User Group ● LUCENE-7498 - Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor ● LUCENE-8326 - MLT Params Refactor JIRA References
  • 29. Apache Lucene/Solr London User Group Questions ?
  • 30. Apache Lucene/Solr London User GroupThanks!