Inverted index and tokenizer

One, inverted index

1. What is an inverted index

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-g7SO2mhV-1596981293103)(http://note.youdao.com/yws/res/60510/2AE09DC7D9A24970A52A45FE10EB8BDF)]

It mainly consists of two parts:

1. Term Dictionary

Record the words of all documents, and record the association relationship between the words and the inverted list. The amount of word data in the production environment is generally relatively large, which can be implemented by B+ tree or hash zipper method to meet high-performance insertion or query.

2. Posting List

Some inverted index item information of the document corresponding to the word is recorded. mainly includes:

  • Document ID
  • Word frequency (the number of times the word appears in the document, used for relevance scoring)
  • Position (Position, the position of the word in the document segmentation, used for sentence search)
  • Offset (Record the start and end position of the word to achieve highlighting)

Each field in the ES json document has its own inverted index by default. Of course, you can also specify that certain fields are not indexed. The advantage of this approach is to save space, but the disadvantages are also obvious. Fields that are not indexed cannot be searched.

Second, the tokenizer

1、analyzsis 与 analyzer

In ES, the conversion of the full text is realized through a specific analyzer (analyzer), and the analyzer is converted into individual words. This whole process is called analysis (analyzsis).

The tokenizer is a component that specializes in word segmentation. The tokenizer is mainly composed of three parts:

  • Character Filters, mainly for processing the original text, such as removing html, etc.
  • Tokenizer, to segment documents according to specific rules
  • Token Filter, to process the divided words, such as lowercase processing, delete stopwords, add synonyms, etc.

There are two main categories of tokenizers in ES: ES built-in tokenizers and customized tokenizers. In a production environment, you need to select the correct tokenizer based on the actual situation and business needs. The writing and reading of documents need to use the corresponding tokenizer for analysis.

2. Common built-in tokenizers

Tokenizer basic introduction
Standard Analyzer The default tokenizer, word segmentation, lowercase processing, stopwords off
Simple Analyzer According to non-letter segmentation (characters are filtered), lowercase processing
Stop Analyzer Segmentation by word, stop word filtering (is, a, the), lowercase processing
Whitespace Analyzer Divide according to the space, do not do lowercase processing
Keyword Analyzer No word segmentation processing, direct input as output
Patter Analyzer Regular expression, default \W+ (non-character delimited)
Language Provide 30+ common language tokenizers
Customer Analyzer Custom tokenizer

3. Chinese word segmenter

1)ICU analyzer

Need to download the corresponding plug-in manually, it provides unicode to better support Asian languages.

Elasticsearch-plugin install analysis-icu

2) I.

Support custom thesaurus, support hot update word segmentation dictionary

3) THULAC

A set of Chinese word segmentation provided by the Natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University

4、_analyze API

Three kinds of tests to check the word segmentation effect of the tokenizer

1) Use _analyze directly

GET /_analyze
{
    "analyzer":"standard",
    "text":"This is a Test"
}

2) Specify the specific field of the index to view its word segmentation effect

POST ${index_name}/_analyze
{
    "filed":"title",
    "text":"Mastering Elasticsearch"
}

3) Customize the designated word segmenter to view the effect of word segmentation

POST /_analyze
{
    "tokenizer":"standard",
    "filter":["lowercase"],
    "text":"Mastering Elasticsearch"
}

Guess you like

Origin blog.csdn.net/weixin_37692493/article/details/107901095