Simple talk about inverted index

Simple talk about inverted index


1. Concept

Inverted index (Inverted index), also often referred to as reverse index, embedded file or reverse file, is an index method used to store a word in a document or a group of documents under full-text search A map of storage locations in , which is the most commonly used data structure in document retrieval systems.

​ Through the inverted index, you can quickly obtain a list of documents containing the word according to the word. The inverted index is mainly composed of two parts: "word dictionary" and "inverted file".


expand:

Inverted index There are two different inverted index forms:

  • A record's horizontal inverted index (or inverted archive index) contains a list of documents for each referenced word.
  • A word's horizontal inverted index (or full inverted index) in turn contains the position of each word in a document.



2. Process

​ The search engine (ES) divides the stored data document into multiple words through a tokenizer (IK tokenizer), and then builds an inverted index table based on the words. When a search sentence is entered, the search engine will segment the sentence, then calculate the correlation score with each data document against the inverted index table, and finally return the document with the highest correlation score with the sentence.

​ Core: word segmentation, building a word index table, and calculating the relevance score.



3. Examples

Documentation:

  • Doc1: PHP is the best programming language in the world.
  • Doc2: Java is the best programming language in the world.
  • Doc3: C is the best programming language in the world.
  • Doc4: C++ is the best programming language in the world.
  • Doc5: Python is the best programming language in the world.

Inverted index:

word ID word inverted list
1 PHP 1
2 World 1,2,3,4,5
3 Programming language 1,2,3,4,5
4 Java 2
5 C 3
6 C++ 4
7 Python 5

​ When inputting "Java programming language", the tokenizer will divide it into two words "Java" and "programming language", and then compare the inverted index table to calculate the correlation score with each document, from the inverted index table It can be seen that the two words "Java", "programming language" have the highest matching degree in Doc2, so return Doc2: Java is the best programming language in the world.

おすすめ

転載: blog.csdn.net/weixin_51123079/article/details/128145071