From the course: Advanced PostgreSQL

Full-text search

It's quite often the case that the data that you store in your database tables contain text and you want to search this text in a natural and expressive manner. That's exactly what full text search enables. Full text search identifies natural language documents that satisfy a query. When you see the term natural language, this implies that the queries are able to identify the meanings of the term. So it's not looking for the actual presence of the term, but for a word that has the same meaning or the same root form. The documents returned in the search results are sorted based on how relevant they are to the query that you've just used to perform search. In the simplest form, you can find all documents containing text similar to the query terms that you have specified. It's been possible for the past several years to perform text search on database documents using operators such as like and Ilike. But these operators are rather limiting because they do not offer any linguistic support. They do not recognize derived words which have the same root meaning. For example, the linguistic meaning of the terms amaze and amazing is the same. But like and Ilike, operators will not identify amaze if amazing is present in your document. These operators only return the documents where the search terms are specified. There is no ordering or ranking of the search results giving you the most relevant documents first. Also, there is no underlying indexing of the documents containing different search terms. So this search operation is rather slow because every document has to be checked to see whether the terms exist in those documents. In order to be able to perform full text search, the documents that you want to search need to be pre-processed and indexed. First, the text contained in the documents need to be tokenized, parsed into tokens. Tokenization involves extracting individual terms that exist in your text. The tokens in the document are then converted to lexemes. Lexemes represent a normalized or root form of the original words. For example, the lexeme for amaze and amazing will be the same. The documents are then stored in an optimized form to enable search. For example, there will be some kind of index structure specifying what tokens or lexemes are present in the individual documents allowing lookup by lexemes to be very fast. Let's talk about the first of these in more detail. Parsing documents into tokens. This is called tokenization and this is performed using tools known as parsers. It's also possible to configure parsers in PostgreSQL. Now, these parsers identify all of the unique terms that exist in your document. They are able to identify numbers, words, email addresses, very complex words, and each of these different kinds of terms are processed differently. Postgres contains a whole number of standard parsers for different languages. You can also develop your own custom parsers. Next, let's understand what it means to convert tokens to lexemes. Lexemes attempt to get to the linguistic root of a particular word. Lexemes are just normalized strings such that different forms of the same word are made alike. Converting to lexemes will involve converting all of the words to lowercase, removing any suffixes if they exist for the word and eliminating stopwords from your text. Lexemes are extremely useful for indexing and searching documents, and this conversion to lexemes is performed in Postgres using something known as a dictionary. In order to perform full text search, your pre-processed documents need to be stored in an optimized form. Documents are represented as an array of lexemes and this array is optimized for performing search operations. In addition to the array of lexemes that make up a document, additional information is also stored about every lexeme in the document, such as positional information. This is stored for proximity ranking of search results. Now this matters in ranking your search results documents which have a dense region of query words tend to have a higher ranking. For example, if your query terms contain multiple words and these words are close together in a single document, that document will be ranked higher than a document where the same terms occur, but they are spread apart. They are not close together. It's important to note that the term document refers to a unit of searching in a full text search system. It is the text that you're actually searching. Before we get to actual full text search, let's quickly discuss two terms that form the main building blocks for full text search in PostgreSQL. The first is the tsvector. This essentially represents a searchable document. Only documents that are represented in the tsvector format can be searched. And then we have the tsquery. This represents a search query to run against a document in order to perform full text search.

Contents