Data structure and algorithm: inverted table/inverted index (Inverted index)

The core technology of search engines, inverted indexing technology. Inverted indexing may need to be divided into several articles before it can be finished. We will first talk about the technical principles of inverted indexing, and then talk about how to use some data structures and algorithms. Implement an inverted index, and then talk about how an indexer can generate an inverted index from documents.

What is an inverted index? We all know that the index is to find the data structure of the document faster, such as numbering the document, then you can quickly find a certain document through this number, and the inverted index is not based on the document number, but Find the index structure of the document through certain words in the document.

The inverted index technology is simple and efficient. It is tailor-made for such things as search engines. It is with this technology that it is possible to realize a search engine, and we can find what we want through a keyword in a large number of articles. Content.

Let's look at an example, there are the following documents:

Document number Document content
1 This is an engine implemented in Go language
2 PHP is the best language in the world
3 Linux is implemented in C language and assembly language
4 Google is one of the best search companies in the world

Intuitively, we can find documents quickly by numbering 1, 2, 3, and 4, but we need to find documents by keywords, so change the table above slightly, which is an inverted index.

Inverted table (inverted index) [only some keywords are listed]

Key words Document number
Go 1
Language 1,2,3
achieve 1,3
search for 4
engine 1
PHP 2
world 2,4
the best 2,4
compilation 3
the company 4

This is very easy to understand. In fact, the inverted index is to regenerate a table after cutting the content of the document. Through this table, we can quickly find the document corresponding to each keyword. Okay, no more , Here is the core principle of the inverted index, and also the most basic cornerstone of the search engine, whether it is Google or a certain degree, the core thing is these two tables, without these two tables, nothing can be done.

It looks very simple, okay, let’s simulate a search engine for a search, for example, we type in a keyword: search engine

  1. First segment the word "search engine": search/engine;
  2. We found in Table 2 that the word "search" appears in line 4, and the word "engine" appears in line 5;
  3. Find the 2nd column of the 4th row and the 2nd column of the 5th row, and find the document number, which is 1 and 4
  4. Go to the first table to find out the actual content of each document by document number
  5. Display the results of 1 and 4
  6. Search complete

The above is the most basic technology of search engines. If we design a data structure and algorithm to implement Table 2 it becomes the key to search engine technology.
Insert picture description here
Before implementing data structures and algorithms, we need to know that search engines search for massive amounts of data. Generally, the data of medium-sized e-commerce companies is tens or hundreds of gigabytes of data, so this data structure should be stored on the local disk. It is not in memory. Based on the above considerations, in order to search quickly, either implement the cache by yourself to cache hot data, or consider using the underlying technology of the operating system MMAP. In view of the fact that my own implementation of the cache is not necessarily possible (basically it is unlikely) It's better than the operating system, so I use MMAP.




Reference:
Inverted Index of Search

Guess you like

Origin blog.csdn.net/u013250861/article/details/113732888