Elasticsearch Analyzers Field-Level Optimization.pdf

Elasticsearch Analyzers: Field-Level
Optimization
Elasticsearch analyzers are a fundamental aspect of text processing, shaping how
data is indexed and searched within the system. In addition to the default analyzer,
Elasticsearch offers a range of specialized analyzers tailored to specific needs. In this
blog post, we will delve into analyzers such as Keyword, Language, Pattern, Simple,
Standard, Stop, and Whitespace. Understanding when to use each analyzer will
empower you to optimize your Elasticsearch setup for diverse scenarios.
What are Elasticsearch Analyzers?
Elasticsearch analyzers are a critical component of the Elasticsearch search engine,
and they are meant to process and index text data for speedy and accurate search
operations. Character filters, tokenizers, and token filters are the three primary
components of an analyzer.
Tokenizers separate the text into individual tokens, while token filters change or
filter these tokens. Elasticsearch can handle activities like stemming (reducing words

to their root form), lowercasing, and deleting stop words using analyzers to improve
the quality of search results.
Elasticsearch comes with default analyzers for a variety of languages, and users may
also develop custom analyzers to meet specific indexing and search needs.
Configuring analyzers well is critical for optimizing search functionality in
Elasticsearch and increasing the relevancy of search results.
Must Read: Explore Elasticsearch and Why It’s Worth Using?
What are the key features of Elasticsearch
Analyzers
• Tokenization: Elasticsearch analyzers break down the text into tokens, the
smallest meaning units. This process is essential for efficient search
operations.
• Character Filtering: Character filters in analyzers preprocess input text by
applying transformations or substitutions to characters before tokenization,
allowing for cleaner and standardized data.
• Token Filtering: After tokenization, analyzers employ token filters to modify or
filter tokens. This step includes actions like stemming, lowercasing and
removing stop words to improve the relevance of search results.
• Multilingual Support: Elasticsearch analyzers are designed to handle diverse
language datasets, providing support for multilingual text analysis and
indexing.
• Default Analyzers: Elasticsearch comes with default analyzers for various
languages, offering convenient out-of-the-box solutions for common
scenarios.
• Index-Time and Query-Time Analysis: Analyzers operate at both index time
and query time. This dual functionality allows for flexibility in how text is
processed during data indexing and user search queries.
• Stemming: Analyzers support stemming, the process of reducing words to
their root form, which enhances the inclusiveness of search results by
capturing variations of a word.

Explore firsthand the functionality of Elasticsearch analyzers through practical code
demonstrations. These examples serve as a gateway to understanding the inner
workings of analyzers, showcasing how they facilitate efficient indexing and powerful
search capabilities within Elasticsearch. Mastering these analyzers not only aids in
refining Elasticsearch queries but also enhances overall indexing strategies for
optimal performance.
1. Simple Analyzer
The simple analyzer breaks text into tokens at any non-letter character, such as
numbers, spaces, hyphens, and apostrophes, discards non-letter characters, and
changes uppercase to lowercase.
The simple analyzer is defined by one tokenizer which is a lowercase tokenizer.
Example
POST _analyze
{
“analyzer”: “simple”,
“text”: “The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.”
}
Tokens generated
[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Use Case: Basic Tokenization
Scenario: In situations where a simple tokenization approach is sufficient, such as
when dealing with less structured or informal text, the simple analyzer provides a
straightforward solution without extensive filtering.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“type”: “text”,
“analyzer”: “simple”
}
}
}

2. Standard analyzer
The standard analyzer is the default analyzer which is used if none is specified. It
provides grammar based tokenization (based on the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29) and works well for most
languages.
Example
POST _analyze
{
“analyzer”: “standard”,
}
Token Generated
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]
Use Case: Common English Words Inclusion
Scenario: Use the standard analyzer when you want to index and search for common
words while maintaining tokenization and lowercase conversion.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“analyzer”: “standard”
}
}
}
3. Keyword analyzer
The keyword analyzer is a “noop” analyzer that returns the entire input string as a
single token.
Example

POST _analyze
{
“analyzer”: “keyword”,
}
Token Generated
[ The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]
Use Case: Exact Match Searches
Scenario: You have identifiers like product codes, document IDs, or tags that should
not be tokenized. The keyword analyzer is suitable for scenarios where you need to
search for exact matches without breaking down the input into individual words.
Mapping:
“mappings”: {
“properties”: {
“keyword_field”: {
“type”: “keyword”,
“analyzer”: “keyword”
}
}
}
4. Whitespace analyzer
The whitespace analyzer breaks text into terms whenever it encounters a whitespace
character.
Example
POST _analyze
{
“analyzer”: “keyword”,
}
Token Generated
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone.]
Use Case: Maintain Text Structure
Scenario: Your data has distinct terms separated by whitespace, and you want to
preserve this structure. The whitespace analyzer tokenizes the input based on

whitespace characters, allowing you to index and search for terms as they appear in
the original text.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“analyzer”: “whitespace”
}
}
}
5. Pattern analyzer
The pattern analyzer uses a regular expression to split the text into terms. The
regular expression should match the token separators, not the tokens themselves.
The regular expression defaults to W+ (or all non-word characters).
Example
POST _analyze
{
“analyzer”: “pattern”,
}
Token Generated
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
Use Case: Custom Text Formats
Scenario: You have structured data with specific patterns or custom text formats
that need specialized parsing. The pattern analyzer allows you to define regular
expressions for tokenization, making it suitable for scenarios where a predefined
structure exists. Examples: emails, phone numbers, dates, etc.
Mapping:
“mappings”: {
“properties”: {
“custom_field”: {
“analyzer”: “pattern”,

“pattern”: “s*,s*” // Example: Tokenize by commas with optional spaces
}
}
}
6. Stop analyzer
The stop analyzer is the same as the simple analyzer but adds support for removing
stop words. It defaults to using the _english_ stop words. The common stop words in
English are is, on, the, a, an, etc.
Example
POST _analyze
{
“analyzer”: “stop”,
}
Token Generated
[ quick, brown, foxes, jumped, over, lazy, dog, s, bone]
Use Case: Case-Sensitive Searches with Stop Word Removal
Scenario: You require case-sensitive searches but want to exclude common stop
words. The stop analyzer allows you to maintain case sensitivity while filtering out
frequently occurring words that may not add significant value to your search results.
Mapping:
“mappings”: {
“properties”: {
“text_field”: {
“analyzer”: “stop”
}
}
}
7. Language Analyzer
It is a tailored analyzer for specific languages (e.g., English, Spanish, French).
Incorporates language-specific tokenization and stemming rules for more accurate
and context-aware indexing.

Example to add bengali custom analyzer
PUT /bengali_example
{
“settings”: {
“analysis”: {
“filter”: {
“bengali_stop”: {
“type”: “stop”,
“stopwords”: “_bengali_”
},
“bengali_keywords”: {
“type”: “keyword_marker”,
“keywords”: [“উদাহরণ”]
},
“bengali_stemmer”: {
“type”: “stemmer”,
“language”: “bengali”
}
},
“analyzer”: {
“rebuilt_bengali”: {
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“decimal_digit”,
“bengali_keywords”,
“indic_normalization”,
“bengali_normalization”,
“bengali_stop”,
“bengali_stemmer”
]
}
}
}
}
}

With the following analyzer, you would be able to analyze bengali words bengali with
Bengali stop words and stemmers.
Use Case: Multilingual Content
Scenario: Your dataset includes documents in different languages. By using
language-specific analyzers (e.g., English, Spanish, French), you can account for
language-specific tokenization and stemming, improving the accuracy of search
results in diverse linguistic contexts.
Conclusion
Elasticsearch provides a rich set of analyzers catering to various use cases. Whether
dealing with multilingual content, structured data, or specific tokenization needs,
selecting the right analyzer is key to achieving efficient and accurate search results.
By understanding the nuances of analyzers like Keyword, Language, Pattern, Simple,
Standard, Stop, and Whitespace, you can fine-tune your Elasticsearch setup for
optimal performance and relevance in diverse scenarios. Partnering with experts in
ElasticSearch Consulting and Development Services can further amplify your
Elasticsearch capabilities for tailored and effective solutions.
Originally published by: Elasticsearch Analyzers: Field-Level Optimization

Elasticsearch Analyzers Field-Level Optimization.pdf

More Related Content

Similar to Elasticsearch Analyzers Field-Level Optimization.pdf (20)

More from Inexture Solutions (20)

Recently uploaded (20)

Elasticsearch Analyzers Field-Level Optimization.pdf