Search explained T3DD15

My Name is Hans Höchtl
Technical director @ Onedrop Solutions
PHP, Java, Ruby Developer
Participation in TYPO3 Solr

SELECT * FROM mytable WHERE
field LIKE „%searchword%“
SELECT * FROM mytable WHERE
field SOUNDS LIKE
„searchword“

Appearance of a word inside a
text can be determined easily.
But is it relevant?

Relevance is subjective and
depends on the judgement of
users.
We use „scoring“ to predict
relevance.

Scoring is computed by a function
applied on our indexed
documents using the search term
as input parameter.

TF-IDF  
Term frequency-inverse document
frequency
BM25 
Okapi BM25 - Best Matching
DFR 
Divergence from randomness
and many more

All those scoring calculations should
fulfill these two requirements:
1. Precision 
Are the results relevant to the user?
2. Recall 
Have we found all relevant content
in the index?

How to store documents for
efficient computing of scoring?

Vector Space Model 
Default in Solr, Elasticsearch
Document: A vector of terms
Term: A „word“ inside a document
Each unique term is a dimension

Vector Space Model 
The best match is the narrowest
angle between query and
document

Document 1
„unique unique
bag“
Document 2
„unique bag
bag“
Query
unique bag
unique
bag
v(d1)
v(q)
v(d2)

The calculation of the cosine of
the angle between the vectors is
much easier than the calculation
of the angle itself. (CPU cycles)

Where d2 * q is the intersection
(dot product) of the document and
the query vectors.
||q|| is the norm vector of q

A cosine value of zero means that
the query and document vector
are orthogonal and have no
match.

TF-IDF
Regarding the vector space model (VSM)
the weight of the vector is now
represented for a document d as:
Term frequency
Inverse document frequency

TF-IDF
Now we have everything together to
calculate the similarity between
documents using TF-IDF:

TF-IDF
PROs CONs
- Simple model based on linear
algebra
- Term weights not binary
- Allows computing a
continuous degree of
similarity between queries
and documents
- Allows ranking of documents
according to their possible
relevance
- Allows partial matching
- Long documents have poor
similarity values (small scalar
and large dimensionality)
- Search keywords must
precisely match terms
- Missing semantic sensitivity
- Order of terms in document
not taken into account
- Terms are usually not
statistically independent (as
this model states)

TF-IDF - The Lucene way
Coord: Boosts documents that
match more of the search terms
(multiple words) => 3/4 vs 4/4
Norm: Length normalization boosts
fields that are shorter

TF-IDF - Multiple fields
TF-IDF expects a document to be
just one field containing text. But in
reality we have semi-structured
documents containing fields like
author, subtitle, etc.

TF-IDF - Multiple fields
Solr Solution: DisMax Query Parser (Maximum
Disjunction)
Searchterm: „my funny house“
Documents
matching query in
field title
Documents
matching
query in field
subtitle
Documents
matching query in
field content
TF-IDF calculated
for every field
independently.
Score of a
document is the
highest score of the
field scoring values.

Natural languages
Adjectives, Adverbs, Nouns,
Verbs, Conjunctions, Prepositions,
Predicates, Compounds, Plurals,
Past tense, Declination,
Semantics, etc.

Language families
Indo-European languages
Sino-Tibetan languages

TF-IDF Problem
Only exakt Term matches are
considered a hit.
„Car“ is not the same term as
„Cars“

Handling human languages (Analyzers)
Tokenizers: 
Splits a stream of characters into a series of
tokens.
Filters: 
The generated tokens are passed through a
series of filters that add, change or remove
tokens.

Index Analyzers vs. Query Analyzers
Index Analyzers: 
Perform their analysis chain on the token stream
during indexation. The generated tokens will be
indexed.
Query Analyzers: 
Perform their analysis chain on the entered search
query during query execution. Otherwise the query
would hit just an exact match.
Beware of Synonyms!

Available analyzers
Solr (https://goo.gl/TXEjZK) 
Language best practices (https://goo.gl/11O2Qz)
Elasticsearch (https://goo.gl/QR1IYb) 
Language best practices (https://goo.gl/6FQt7A)

FieldTypes
Solr and Elasticsearch use
fieldTypes assigned to fields for
defining the analyzer chain that
should be performed

Let’s take a look in the
configuration of TYPO3 Solr and
Neos Elasticsearch

Let’s test the analyzer chain
Solr and Elasticsearch

Display score calculation
Solr:  
/solr/core_de/select?
q=test&debugQuery=1
Elasticsearch:  
/_explain instead of /_search

Let’s take a look at
0.51602894 = (MATCH) sum of:
0.51602894 = (MATCH) max of:
0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity],
result of:
0.51602894 = fieldWeight in 5, product of:
2.0 = tf(freq=4.0), with freq of:
4.0 = termFreq=4.0
3.3025851 = idf(docFreq=4, maxDocs=50)
0.078125 = fieldNorm(doc=5)
0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity],
result of:
0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0
), product of:
0.05 = queryWeight, product of:
2.0 = boost
0.0075698276 = queryNorm
3.3025851 = fieldWeight in 5, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
1.0 = fieldNorm(doc=5)

Product-Codes
„AS1134-B“
„131555813“
„EOS 500D“
„13 S24 36-G“

Product-Codes
Index the code in multiple fields to
have different analyzers and boost
them from strict to fuzzy.
Make use of N-Grams, EdgeN-
Grams, WordDelimiter, Trim, etc.

Use the knowledge you gain from
your customers to improve your
search, … like Google does.

- Use Google Analytics during index
time (preAddModifyDocuments hook)
- Use recency of news (boostfunction)
- Analyze the search behavior of your
customers (popularity of pages)
- Track search result clicks

Some more interesting thinks
- Facets
- Spellchecking
- Phonetics
- Spatial

Thank you
Mail: hhoechtl@1drop.de or jhoechtl@gmail.com 
Twitter: @hhoechtl 
Blog: http://blog.1drop.de

Search explained T3DD15

More Related Content

What's hot (19)

Similar to Search explained T3DD15 (20)

Recently uploaded (20)

Search explained T3DD15