Search explained T3DD15
Search explained
My Name is Hans Höchtl
Technical director @ Onedrop Solutions
PHP, Java, Ruby Developer
Participation in TYPO3 Solr
SELECT * FROM mytable WHERE
field LIKE „%searchword%“
SELECT * FROM mytable WHERE
field SOUNDS LIKE
„searchword“
Appearance of a word inside a
text can be determined easily.
But is it relevant?
Relevance is subjective and
depends on the judgement of
users.
We use „scoring“ to predict
relevance.
Scoring is computed by a function
applied on our indexed
documents using the search term
as input parameter.
TF-IDF 

Term frequency-inverse document
frequency
BM25

Okapi BM25 - Best Matching
DFR

Divergence from randomness
and many more
All those scoring calculations should
fulfill these two requirements:
1. Precision

Are the results relevant to the user?
2. Recall

Have we found all relevant content
in the index?
How to store documents for
efficient computing of scoring?
Vector Space Model

Default in Solr, Elasticsearch
Document: A vector of terms
Term: A „word“ inside a document
Each unique term is a dimension
Vector Space Model

The best match is the narrowest
angle between query and
document
Document 1
„unique unique
bag“
Document 2
„unique bag
bag“
Query
unique bag
unique
bag
v(d1)
v(q)
v(d2)
The calculation of the cosine of
the angle between the vectors is
much easier than the calculation
of the angle itself. (CPU cycles)
Where d2 * q is the intersection
(dot product) of the document and
the query vectors.
||q|| is the norm vector of q
A cosine value of zero means that
the query and document vector
are orthogonal and have no
match.
TF-IDF
Regarding the vector space model (VSM)
the weight of the vector is now
represented for a document d as:
Term frequency
Inverse document frequency
TF-IDF
Now we have everything together to
calculate the similarity between
documents using TF-IDF:
TF-IDF
PROs CONs
- Simple model based on linear
algebra
- Term weights not binary
- Allows computing a
continuous degree of
similarity between queries
and documents
- Allows ranking of documents
according to their possible
relevance
- Allows partial matching
- Long documents have poor
similarity values (small scalar
and large dimensionality)
- Search keywords must
precisely match terms
- Missing semantic sensitivity
- Order of terms in document
not taken into account
- Terms are usually not
statistically independent (as
this model states)
TF-IDF - The Lucene way
Coord: Boosts documents that
match more of the search terms
(multiple words) => 3/4 vs 4/4
Norm: Length normalization boosts
fields that are shorter
TF-IDF - Multiple fields
TF-IDF expects a document to be
just one field containing text. But in
reality we have semi-structured
documents containing fields like
author, subtitle, etc.
TF-IDF - Multiple fields
TF-IDF expects a document to be
just one field containing text. But in
reality we have semi-structured
documents containing fields like
author, subtitle, etc.
TF-IDF - Multiple fields
Solr Solution: DisMax Query Parser (Maximum
Disjunction)
Searchterm: „my funny house“
Documents
matching query in
field title
Documents
matching
query in field
subtitle
Documents
matching query in
field content
TF-IDF calculated
for every field
independently.
Score of a
document is the
highest score of the
field scoring values.
Natural languages
Adjectives, Adverbs, Nouns,
Verbs, Conjunctions, Prepositions,
Predicates, Compounds, Plurals,
Past tense, Declination,
Semantics, etc.
Language families
Indo-European languages
Sino-Tibetan languages
TF-IDF Problem
Only exakt Term matches are
considered a hit.
„Car“ is not the same term as
„Cars“
Handling human languages (Analyzers)
Tokenizers:

Splits a stream of characters into a series of
tokens.
Filters:

The generated tokens are passed through a
series of filters that add, change or remove
tokens.
Index Analyzers vs. Query Analyzers
Index Analyzers:

Perform their analysis chain on the token stream
during indexation. The generated tokens will be
indexed.
Query Analyzers:

Perform their analysis chain on the entered search
query during query execution. Otherwise the query
would hit just an exact match.
Beware of Synonyms!
Available analyzers
Solr (https://goo.gl/TXEjZK)

Language best practices (https://goo.gl/11O2Qz)
Elasticsearch (https://goo.gl/QR1IYb)

Language best practices (https://goo.gl/6FQt7A)
FieldTypes
Solr and Elasticsearch use
fieldTypes assigned to fields for
defining the analyzer chain that
should be performed
Let’s take a look in the
configuration of TYPO3 Solr and
Neos Elasticsearch
Let’s test the analyzer chain
Solr and Elasticsearch
Display score calculation
Solr: 

/solr/core_de/select?
q=test&debugQuery=1
Elasticsearch: 

/_explain instead of /_search
Let’s take a look at
0.51602894 = (MATCH) sum of:
0.51602894 = (MATCH) max of:
0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity],
result of:
0.51602894 = fieldWeight in 5, product of:
2.0 = tf(freq=4.0), with freq of:
4.0 = termFreq=4.0
3.3025851 = idf(docFreq=4, maxDocs=50)
0.078125 = fieldNorm(doc=5)
0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity],
result of:
0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0
), product of:
0.05 = queryWeight, product of:
2.0 = boost
3.3025851 = idf(docFreq=4, maxDocs=50)
0.0075698276 = queryNorm
3.3025851 = fieldWeight in 5, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
3.3025851 = idf(docFreq=4, maxDocs=50)
1.0 = fieldNorm(doc=5)
Product-Codes
„AS1134-B“
„131555813“
„EOS 500D“
„13 S24 36-G“
Product-Codes
Index the code in multiple fields to
have different analyzers and boost
them from strict to fuzzy.
Make use of N-Grams, EdgeN-
Grams, WordDelimiter, Trim, etc.
Use the knowledge you gain from
your customers to improve your
search, … like Google does.
- Use Google Analytics during index
time (preAddModifyDocuments hook)
- Use recency of news (boostfunction)
- Analyze the search behavior of your
customers (popularity of pages)
- Track search result clicks
Some more interesting thinks
- Facets
- Spellchecking
- Phonetics
- Spatial
Search explained T3DD15
Thank you
Mail: hhoechtl@1drop.de or jhoechtl@gmail.com

Twitter: @hhoechtl

Blog: http://blog.1drop.de

More Related Content

PPT
Boolean Retrieval
PDF
Text Mining Analytics 101
PDF
Search pitb
PPTX
PDF
Semantic & Multilingual Strategies in Lucene/Solr
PDF
SA2: Text Mining from User Generated Content
PPTX
Information Retrieval
PPTX
Boolean Retrieval
Text Mining Analytics 101
Search pitb
Semantic & Multilingual Strategies in Lucene/Solr
SA2: Text Mining from User Generated Content
Information Retrieval

What's hot (19)

PPTX
Information Retrieval-1
PPTX
Information retrieval 7 boolean model
PDF
Implementation of Urdu Probabilistic Parser
PPTX
Textmining Information Extraction
PPT
Email Data Cleaning
PPTX
Information retrieval 10 tf idf and bag of words
PPT
Role of Text Mining in Search Engine
PPTX
Optimizing multilingual search in SOLR
PPTX
Data Mining: Text and web mining
PPT
Ijcai 2007 Pedersen
PPTX
Introduction to Text Mining
PPTX
NLP and LSA getting started
PPTX
E-LEARN: Search Strategies
PPT
Cs583 info-retrieval
PPT
Textmining Introduction
PPTX
Dictionary implementation using TRIE
PPTX
A Deep Dive into RESTful API Design Part 2
PDF
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
PPTX
Information Retrieval-1
Information retrieval 7 boolean model
Implementation of Urdu Probabilistic Parser
Textmining Information Extraction
Email Data Cleaning
Information retrieval 10 tf idf and bag of words
Role of Text Mining in Search Engine
Optimizing multilingual search in SOLR
Data Mining: Text and web mining
Ijcai 2007 Pedersen
Introduction to Text Mining
NLP and LSA getting started
E-LEARN: Search Strategies
Cs583 info-retrieval
Textmining Introduction
Dictionary implementation using TRIE
A Deep Dive into RESTful API Design Part 2
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Ad

Similar to Search explained T3DD15 (20)

PDF
Elastic Relevance Presentation feb4 2020
PPTX
Building Search & Recommendation Engines
PDF
Information retrieval to recommender systems
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
PDF
MODELING AND RETRIEVAL 4.pdfMODELING AND RETRIEVAL EVALUATION
PPTX
The Intent Algorithms of Search & Recommendation Engines
PDF
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
PPTX
IRT Unit_ 2.pptx
DOCX
UNIT 3 IRT.docx
PPT
Ir models
PDF
Enhancing relevancy through personalization & semantic search
PDF
Tutorial 1 (information retrieval basics)
PPT
Lec 4,5
PPTX
Introduction to search engine-building with Lucene
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PPTX
Big data elasticsearch practical
PPTX
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
PPTX
Introduction to search engine-building with Lucene
PPTX
Connect and search your data
PDF
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Elastic Relevance Presentation feb4 2020
Building Search & Recommendation Engines
Information retrieval to recommender systems
unit -4MODELING AND RETRIEVAL EVALUATION
MODELING AND RETRIEVAL 4.pdfMODELING AND RETRIEVAL EVALUATION
The Intent Algorithms of Search & Recommendation Engines
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
IRT Unit_ 2.pptx
UNIT 3 IRT.docx
Ir models
Enhancing relevancy through personalization & semantic search
Tutorial 1 (information retrieval basics)
Lec 4,5
Introduction to search engine-building with Lucene
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Big data elasticsearch practical
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Introduction to search engine-building with Lucene
Connect and search your data
Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME
Ad

Recently uploaded (20)

PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PPTX
Computer Software - Technology and Livelihood Education
PDF
Microsoft Office 365 Crack Download Free
PDF
E-Commerce Website Development Companyin india
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PPTX
MLforCyber_MLDataSetsandFeatures_Presentation.pptx
PPTX
Cybersecurity: Protecting the Digital World
PPTX
Introduction to Windows Operating System
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
AI Guide for Business Growth - Arna Softech
PPTX
most interesting chapter in the world ppt
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
DNT Brochure 2025 – ISV Solutions @ D365
Visual explanation of Dijkstra's Algorithm using Python
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Computer Software - Technology and Livelihood Education
Microsoft Office 365 Crack Download Free
E-Commerce Website Development Companyin india
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
MLforCyber_MLDataSetsandFeatures_Presentation.pptx
Cybersecurity: Protecting the Digital World
Introduction to Windows Operating System
GSA Content Generator Crack (2025 Latest)
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
Tech Workshop Escape Room Tech Workshop
Airline CRS | Airline CRS Systems | CRS System
BoxLang Dynamic AWS Lambda - Japan Edition
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
AI Guide for Business Growth - Arna Softech
most interesting chapter in the world ppt
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
DNT Brochure 2025 – ISV Solutions @ D365

Search explained T3DD15

  • 3. My Name is Hans Höchtl Technical director @ Onedrop Solutions PHP, Java, Ruby Developer Participation in TYPO3 Solr
  • 4. SELECT * FROM mytable WHERE field LIKE „%searchword%“ SELECT * FROM mytable WHERE field SOUNDS LIKE „searchword“
  • 5. Appearance of a word inside a text can be determined easily. But is it relevant?
  • 6. Relevance is subjective and depends on the judgement of users. We use „scoring“ to predict relevance.
  • 7. Scoring is computed by a function applied on our indexed documents using the search term as input parameter.
  • 8. TF-IDF 
 Term frequency-inverse document frequency BM25
 Okapi BM25 - Best Matching DFR
 Divergence from randomness and many more
  • 9. All those scoring calculations should fulfill these two requirements: 1. Precision
 Are the results relevant to the user? 2. Recall
 Have we found all relevant content in the index?
  • 10. How to store documents for efficient computing of scoring?
  • 11. Vector Space Model
 Default in Solr, Elasticsearch Document: A vector of terms Term: A „word“ inside a document Each unique term is a dimension
  • 12. Vector Space Model
 The best match is the narrowest angle between query and document
  • 13. Document 1 „unique unique bag“ Document 2 „unique bag bag“ Query unique bag unique bag v(d1) v(q) v(d2)
  • 14. The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)
  • 15. Where d2 * q is the intersection (dot product) of the document and the query vectors. ||q|| is the norm vector of q
  • 16. A cosine value of zero means that the query and document vector are orthogonal and have no match.
  • 17. TF-IDF Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as: Term frequency Inverse document frequency
  • 18. TF-IDF Now we have everything together to calculate the similarity between documents using TF-IDF:
  • 19. TF-IDF PROs CONs - Simple model based on linear algebra - Term weights not binary - Allows computing a continuous degree of similarity between queries and documents - Allows ranking of documents according to their possible relevance - Allows partial matching - Long documents have poor similarity values (small scalar and large dimensionality) - Search keywords must precisely match terms - Missing semantic sensitivity - Order of terms in document not taken into account - Terms are usually not statistically independent (as this model states)
  • 20. TF-IDF - The Lucene way Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4 Norm: Length normalization boosts fields that are shorter
  • 21. TF-IDF - Multiple fields TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
  • 22. TF-IDF - Multiple fields TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.
  • 23. TF-IDF - Multiple fields Solr Solution: DisMax Query Parser (Maximum Disjunction) Searchterm: „my funny house“ Documents matching query in field title Documents matching query in field subtitle Documents matching query in field content TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.
  • 24. Natural languages Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.
  • 26. TF-IDF Problem Only exakt Term matches are considered a hit. „Car“ is not the same term as „Cars“
  • 27. Handling human languages (Analyzers) Tokenizers:
 Splits a stream of characters into a series of tokens. Filters:
 The generated tokens are passed through a series of filters that add, change or remove tokens.
  • 28. Index Analyzers vs. Query Analyzers Index Analyzers:
 Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed. Query Analyzers:
 Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match. Beware of Synonyms!
  • 29. Available analyzers Solr (https://goo.gl/TXEjZK)
 Language best practices (https://goo.gl/11O2Qz) Elasticsearch (https://goo.gl/QR1IYb)
 Language best practices (https://goo.gl/6FQt7A)
  • 30. FieldTypes Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed
  • 31. Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch
  • 32. Let’s test the analyzer chain Solr and Elasticsearch
  • 33. Display score calculation Solr: 
 /solr/core_de/select? q=test&debugQuery=1 Elasticsearch: 
 /_explain instead of /_search
  • 34. Let’s take a look at 0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)
  • 36. Product-Codes Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy. Make use of N-Grams, EdgeN- Grams, WordDelimiter, Trim, etc.
  • 37. Use the knowledge you gain from your customers to improve your search, … like Google does.
  • 38. - Use Google Analytics during index time (preAddModifyDocuments hook) - Use recency of news (boostfunction) - Analyze the search behavior of your customers (popularity of pages) - Track search result clicks
  • 39. Some more interesting thinks - Facets - Spellchecking - Phonetics - Spatial
  • 41. Thank you Mail: hhoechtl@1drop.de or jhoechtl@gmail.com
 Twitter: @hhoechtl
 Blog: http://blog.1drop.de