Elasticsearch reverse sort

1. What is an  inverted index?

Elasticsearch inverted index is a very important index structure , from document to word document ID  process

1.1 Through examples, a simple understanding

Take column articles as an example. We usually use the "inverted sort index" technology when searching based on keywords on major platforms.

data structure

Assuming that the storage results of our articles are as above, for the relational database mysql, the common index structure is "id->topic->content". When we search, if we know the id or topic , then the retrieval efficiency is very high. Efficient, because "id" and "title" are very convenient to create indexes.

Forward index

But when we only have one search keyword, for example, when the requirement is to search for articles related to "inverted sort index" , when the index structure is "id->topic->content", we can only search for "topic" and "content". "After scanning the full text, when the order of magnitude goes up, the efficiency is unacceptable!" For this type of search, the index of the relational database is difficult to handle, and it is suitable to use the inverted index of the full-text search.

So what is the structure of the inverted sort index ? To put it simply, it is indexed by "keywords of content" , and the mapping relationship is "keywords of content->ID". In this case, we only need to search in the "keywords", and the efficiency is definitely faster.

Inverted sort index

1.2 Core composition

  • The inverted sort index consists of two parts:
    • 》Word dictionary: record all document words, record the association relationship between the words and the inverted list
    • Inverted list: record words combined with corresponding documents, composed of inverted index items
  • Inverted index entry:
    • Document
    • Word frequency TF  -the number of times a word appears in a document, used for relevance scoring
    • Position-  the position of the word segmentation in the document, used for phrase query
    • Offset -record the start and end position of the word, realize the highlight display
To give a simple example, understand the "inverted index entry": Take Token "learning" as an example:

Reverse sort index item List

2. How does the inverted index work?

It mainly includes two processes: 1. Create an inverted index; 2. Search for an inverted index

2.1 Create an inverted index

Still use the above example. First on the content of the document is divided word, form one of the  token , that is  the word , and then save the correspondence between the token and documents. The results are as follows:

2.2 Inverted index search

Search example 1: "Learning Index"
  • Word segmentation first, get two tokens: "learning", "index"
  • Then go to the inverted index to match

These 2 tokens match in both documents, so both documents will be returned with the same score.

Search example 2: "Learning es"

Similarly, if both documents match, both will be returned. However , the relevance score of document 1 will be higher than that of document 2, because document 1 matches two tokens, and document 2 only matches one token [learning].

Three, Analysis for word segmentation

Analysis: that is, text analysis, is the full text into a series of words (term / token) of the process , also called word; in Elasticsearch can pass through the built-implemented word segmentation, word segmentation can also be on-demand device .

3.1 Analyzer consists of three parts

• Character Filters: Original text processing, such as removing html
• Tokenizer:
Divide into words according to rules • Token Filters: Process, lowercase, delete stopwords, and add synonyms

3.2 Introduction to Analyzer word segmentation process

  • 1) character filter

First, the string passes through each character filter in order. Their task is to sort the strings before word segmentation. A character filter can be used to remove HTML, or convert & to and.

  • 2) tokenizer

Second, the string is divided into individual entries by the tokenizer. When a whitespace tokenizer encounters spaces and punctuation, it may split the text into terms.

  • 3) token filter

Finally, the entry passes through each token filter in order. This process may change the terms, for example, lowercase token filter lowercase (turn ES to es), stop token filter delete terms (for example, useless words like a, and, the), or synonym token filter to add terms (For example, synonyms like jump and leap).

Guess you like

Origin blog.csdn.net/Erica_1230/article/details/114978518