The index basis of search engine indexing

This article is excerpted from Chapter 3 of "This is Search Engine: Detailed Explanation of Core Technologies"


This section introduces some basic concepts related to search engine indexing by introducing simple examples. Understanding these basic concepts is very important for the subsequent in-depth understanding of the working mechanism of indexing.

3.1.1 The word-document matrix

The word-document matrix is ​​a conceptual model that expresses an inclusive relationship between the two, and Figure 3-1 shows its meaning. Each column in Figure 3-1 represents a document, each row represents a word, and the tick position represents a containment relationship.

Figure 3-1 Word-document matrix

From the perspective of the vertical dimension, that is, the document, each column represents which words the document contains. For example, document 1 contains vocabulary 1 and vocabulary 4, but does not contain other words. From the horizontal dimension of words, each row represents which documents contain a certain word. For example, for word 1, word 1 appears in document 1 and document 4, while other documents do not contain word 1. Other rows and columns in the matrix can also be interpreted in this way.

The index of a search engine is actually a specific data structure that implements the "word-document matrix". There are different ways to implement the above conceptual model, such as "inverted index", "signature file", "suffix tree" and so on. However, various experimental data show that "inverted index" is the best way to realize the mapping relationship between words and documents, so this chapter mainly introduces the technical details of "inverted index".

3.1.2 Basic concepts of inverted index

In this section, we will explain some special terms commonly used in inverted indexes. For the convenience of expression, these terms will be used directly in subsequent chapters of this book.

Document: Generally, the processing object of search engines is Internet web pages, and the concept of document is broader, representing a storage object in the form of text. Compared with web pages, it covers more forms, such as Word, PDF, Documents in different formats such as html and XML can be called documents. Another example is an email, a short message, or a Weibo, which can also be called a document. In the remainder of this book, documents are used in many cases to represent textual information.

Document Collection: A collection composed of several documents is called a document collection. For example, a large number of Internet web pages or a large number of e-mails are specific examples of document collections.

Document ID: Inside the search engine, each document in the document collection will be given a unique internal number, and this number will be used as the unique identifier of the document, which is convenient for internal processing. The internal number of each document is It is called "document number", and DocID is sometimes used later to conveniently represent the document number.

Word ID: Similar to the document number, the search engine uses a unique number to represent a word, and the word number can be used as the unique representation of a word.

Inverted Index: Inverted index is a specific storage form that implements "word-document matrix". Through inverted index, a list of documents containing this word can be quickly obtained according to the word. The inverted index mainly consists of two parts: "word dictionary" and "inverted file".

Lexicon: The usual index unit of a search engine is a word. A word dictionary is a collection of strings composed of all words that have appeared in a document collection. Each index item in the word dictionary records some information about the word itself and points to the "reverse". A pointer to a sorted table".

Posting List: The Posting List records the document list of all documents in which a certain word appears and the position information of the word in the document. Each record is called a Posting. From the inverted list, you can see which documents contain a certain word.

Inverted File: The inverted list of all words is often stored sequentially in a file on the disk. This file is called an inverted file, and an inverted file is a physical file that stores the inverted index.

The relationship between these concepts can be clearly seen through Figure 3-2.

Figure 3-2 Schematic diagram of the basic concept of inverted index

3.1.3 Simple example of inverted index

Inverted indexes are very simple in terms of logical structure and basic ideas. Below we illustrate with specific examples, so that readers can have a macro and direct feeling about the inverted index.

Assuming that the document collection contains five documents, the content of each document is shown in Figure 3-3, and the leftmost column in the figure is the document number corresponding to each document. Our task is to build an inverted index on this document collection.


Figure 3-3 Document Collection

Languages ​​such as Chinese and English are different, and there is no clear separation between words, so first of all, the word segmentation system is used to automatically divide the document into word sequences. In this way, each document is converted into a data stream composed of word sequences. For the convenience of subsequent processing of the system, it is necessary to assign a unique word number to each different word, and record which documents contain this word. After this process, we The simplest inverted index can be obtained (see Figure 3-4). In Figure 3-4, the column "Word ID" records the word number of each word, the second column is the corresponding word, and the third column is the inverted list corresponding to each word. For example, the word "Google", the word number is 1, and the inverted list is {1, 2, 3, 4, 5}, indicating that each document in the document collection contains this word.


Figure 3-4 Simple inverted index

The reason why the inverted index shown in Figure 3-4 is the simplest is because the index system only records which documents contain a certain word, when in fact, the index system can record more information than that. Figure 3-5 is a relatively complex inverted index. Compared with the basic index system in Figure 3-4, the inverted list corresponding to a word not only records the document number, but also records the word frequency information (TF), that is The number of occurrences of this word in a certain document is recorded because the word frequency information is an important calculation factor when calculating the similarity between the query and the document when sorting the search results, so it is recorded in the inverted list. , in order to facilitate the score calculation in subsequent sorting. In the example in Figure 3-5, the word number of the word "founder" is 7, and the corresponding inverted list content is: (3:1), where 3 means that the document with document number 3 contains this word, and the number 1 Represents word frequency information, that is, this word appears only once in document No. 3, and the inverted list corresponding to other words has the same meaning.


Figure 3-5 Inverted index with word frequency information

A practical inverted index can also record more information. In addition to recording the document number and word frequency information, the index system shown in Figure 3-6 records two additional types of information, that is, the "document frequency information" corresponding to each word ( Corresponding to the third column of Figure 3-6) and record the position information of the word in a certain document in the inverted list.

 Figure 3-6 Inverted index with word frequency, document frequency, and occurrence information

“文档频率信息”代表了在文档集合中有多少个文档包含某个单词,之所以要记录这个信息,其原因与单词频率信息一样,这个信息在搜索结果排序计算中是非常重要的一个因子。而单词在某个文档中出现的位置信息并非索引系统一定要记录的,在实际的索引系统里可以包含,也可以选择不包含这个信息,之所以如此,因为这个信息对于搜索系统来说并非必需的,位置信息只有在支持“短语查询”的时候才能够派上用场。

以单词“拉斯”为例,其单词编号为8,文档频率为2,代表整个文档集合中有两个文档包含这个单词,对应的倒排列表为:{(3;1;<4>),(5;1;<4>)},其含义为在文档3和文档5出现过这个单词,单词频率都为1,单词“拉斯”在两个文档中的出现位置都是4,即文档中第四个单词是“拉斯”。

图3-6所示倒排索引已经是一个非常完备的索引系统,实际搜索系统的索引结构基本如此,区别无非是采取哪些具体的数据结构来实现上述逻辑结构。

有了这个索引系统,搜索引擎可以很方便地响应用户的查询,比如用户输入查询词“Facebook”,搜索系统查找倒排索引,从中可以读出包含这个单词的文档,这些文档就是提供给用户的搜索结果,而利用单词频率信息、文档频率信息即可以对这些候选搜索结果进行排序,计算文档和查询的相似性,按照相似性得分由高到低排序输出,此即为搜索系统的部分内部流程,具体实现方案本书第五章会做详细描述。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325747220&siteId=291194637