In-depth understanding of Elasticsearch topic: Text Analysis

  • Overview

    All the words come from Elasticsearch Reference 7.0, for study.

    Text Analysis is the process of converting unstructed text, like the body of an email or a product description, into a structured format that’s opetimized for search.

  • Tokenization

    Analysis makes full-text search possible through tokenization: breaking a text down into small chunks, called tokens.

    In most cases, these tokens are individual words.

  • Normalization

    Tokenization enables matching on individual terms, but each token is still matched literally.

    For search enable synonyms, similar meaning, same root word …

  • Analyzer

    Text analysis is performed by an analyzer, a set of rules that govern then entire process.

    A custom analyzer gives you control over each step of the analysis process, including :

    • Changes to the text before tokenization
    • How text is converted to tokens
    • Normalization changes made to tokens before indexing or search

    An analyzer - whether built-in or custom - is just a package which contains three lower-level building blocks: charater filters, tokenizers, and token filters.

  • Character Filters

    A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing chracters.

    An analyzer may have zero or more character filters.

  • Tokenizer

    A tokenizer receives a stream of characters, breaks it up into individual tokens, and outputs a stream of tokens.

    The tokenizer is also responsible for recording the order or position of each term and start and end character offsets of the original word which the term represents.

    An analyzer must have exactly one tokenizer.

  • Token filters

    A token filter receives the token stream and may add, remove, or change tokens.

    An analyzer may have zero or more token filters, which are applied in order.

  • Index and search analysis

    Text analysis occurs at two times:

    When a document is indexed, any text field values are analyzed; when running a ful-text search on a text field, the query string is analyzed, which called search time or query time.

    The analyzer, or set of analysis rules, used at each time is called the index analyzer or search analyzer respectively.

  • Stemming

    Stemming is the process of reducing a word to its root form.

    For example, walking and walked can be stemmed to the same root word: walk.

    In some cases, the root form of a stemmed word may not be a real word.

    Stemming is handled by stemmer token filters. These token filters can be categorized based on how they stem words: Algorithmic stemmers and Dictionary stemmers.

Guess you like

Origin blog.csdn.net/The_Time_Runner/article/details/111709150