Understanding Forward and Inverted Indexes

1. Write in front

In the field of recommendation, we often hear about the establishment of an inverted index, which is convenient for searching. I have always been confused about this one. What is an inverted index, and what is a positive index, what is the meaning of establishing an inverted index? where? Therefore, in this article, I simply understand these two concepts by consulting the information.

The following is my understanding based on the information I checked, which may not be correct.

Forward index and inverted index are two important data structures in the field of document retrieval systems, and I understand that they are actually two different organizational forms for efficient query. What is established is the relationship between documents and words Mapping relations.

  • Positive row index: 文档->关键词The mapping between, that is, when we enter the document number, we will get information such as document content, document keywords, etc. It is more typical, that is {doc: [word1, word2, ..]}, put it in the recommendation, it may correspond to{item: [feature1, feature2, ...]}
  • Inverted index: The mapping between keywords->documents, that is, if we enter keywords, we will find the corresponding document information, which is more typical {word: [doc1, doc2, ...]}and put in the recommendation, which may correspond to{user_id: [item1, item2, ..]}

Since this thing comes from the field of document retrieval, let's start with document retrieval.

In the search engine, each file corresponds to a file ID, and the content of the file is represented as a set of a series of keywords.

2. Positive index

Establish a corresponding relationship with the keyword for the document (page), the structure is as follows:

正排索引: 由文档指向关键词

文档--> 单词1 ,单词2

单词1 出现的次数  单词出现的位置; 单词2 单词2出现的位置  ...

正排索引: 在搜索栏中输入id查词条(已知id)

If it is replaced by a picture:
insert image description here
when the user enters a "keyword" in the search box, assuming that there is only a forward index (forward index), then it is necessary to scan all documents in the index database to find all the documents containing the "keyword" The documents are then scored according to the scoring model, and the ranking is presented to the user .

but? The number of documents included in search engines on the Internet is astronomical, and such an index structure simply cannot meet the requirements of returning ranking results in real time.

Therefore, the search engine will rebuild the forward index into an inverted index , that is, convert the mapping of file IDs to keywords into the mapping of keywords to file IDs . Each keyword corresponds to a series of files. These files This keyword appears in all .

And here, if we compare the recommended field, is the document here like the item we want to recommend? And the keyword here is not like the user_id in our recommended field, the process of giving the relevant document according to the keyword above, Like if a user_id is given, the recommender system recommends related item_lists to him? On the importance of analogy.

3. Inverted index

When a user searches for a keyword in a search engine, the search engine will display documents (pages) related to the keyword to the user. This process is an inverted index, and the keyword points to the document or file.

The structure is as follows:

倒排索引: 由关键词指向文档

单词1--->文档1,文档2,文档3

单词2--->文档1,文档2

倒排索引: 将搜索框中的词进行搜索查到哪些id包含这个词,再查这些id(通过分词查id)

From the keyword of the word, find the document. The words shown in the figure are as follows:
insert image description here
Therefore, when a user searches for a certain keyword, the system immediately locates the keyword in the inverted index, and immediately finds the page containing the keyword.

Inverted indexes have a wide range of application scenarios: search engines, large-scale database indexing, document retrieval, multimedia retrieval, etc.

正排索引: 文档 --> 单词
倒排索引: 单词 --> 文档

4. Simple example

Let's use a small example to see the forward and reverse indexes.

Suppose the document collection contains five documents, each with the following content (picture from the first link below):

insert image description here

Then for the positive index, it should be stored in the following way:

文档编号      正排列表
1      谷歌->地图->之父->跳槽->Facebook
2      谷歌->地图->之父->加盟->Facebook
3      谷歌->....
4
5

And the inverted index:
insert image description here

In addition, the indexing system can also record more information. For example, not only the document number but also the word frequency information (TF) can be recorded in the inverted list corresponding to the word, that is, the number of times the word appears in a certain document. , the reason for recording this information is that the word frequency information is a very important calculation factor when calculating the similarity between the query and the document when sorting the search results.
insert image description here
A more complete inverted index structure can also record more information. In addition to recording the document number and word frequency, some additionally record the document frequency information corresponding to each word and record the word in a certain inverted list. where it appears in the document.

  • Document frequency information: Represents how many documents in the document collection contain a certain word. This information is recorded because this information is also a very important factor in the ranking of search results
  • The position information of the word appearing in a certain document does not have to be recorded, it may or may not be there.

For example, taking the word "Las" as an example, the word number is 8, and the document frequency is 2, which means that there are two documents in the entire document set containing this word, and the corresponding inverted list is {(3; 1; <4> ), (5; 1; <4>)}, the format here (document number; frequency of occurrence in document; occurrence position). I understand that if we have the frequency of a certain word in documents and the number of documents that contain this word, we can calculate the TF-IDF value of the word in each document, which can be used as an important indicator for ranking.

insert image description here

5. Go back to recommendations

5.1 Forward row information of materials

In the recommendation field, index optimization strategies are often used on the recall side, that is, when looking for candidates.

In the resource pool, each material (news, commodity, song) corresponds to an ID, and the material is represented as a collection of a series of field contents . There are fields representing titles, categories, geographic locations, prices, etc. The following figure shows part of the structure of a certain type of material resource pool table.
insert image description here
We can use the unique ID of each material to get various attribute fields of this material, and query the details of this material, which is the positive index in the usual sense.
insert image description here
So what is an inverted index? To understand the inverted index, we need to understand our actual application scenarios. In the recall of the recommendation system, as mentioned above, we actually need to get all the materials under a certain feature, theme or keyword as the recommended candidate set . At this time, it is reversed. The starting point is to find materials with these characteristics. This is the so-called inverted index.

insert image description here

5.2 The idea of ​​inverted row in collaborative filtering

This is also where I came into contact with the term inverted index for the first time, when I generated a recommendation list based on the idea of ​​user collaborative filtering.

We know that user collaborative filtering is given the current user, we need to find the top n users similar to this user according to the similarity, and then see what products these n users clicked, then we may recommend what products .

Suppose, according to the user's behavior log, we have obtained the products clicked by four users A, B, C, and D as follows:
insert image description here
At this time, our general idea of ​​seeking user similarity is that the outer layer traverses users, and the inner layer also traverses users , for each user in the inner layer, look at the number of products clicked together with the current user in the outer layer, or save the vector of products that are jointly clicked. After traversing a user in memory, a similarity can be calculated. The pseudocode is similar to this:

for user1, item_list1 in user_item.items():
	for user2, item_list2 in user_item.items():
		if user1 == user2: continue
		con_click_action = 0
		for item in item_list1:
			if item in item_list2:  # 共同点击
				con_click_action += 1  
		
		# 计算当前用户相似性
		similarity{
    
    user1}{
    
    user2} = con_click_action / math.sqrt(len(item_list1) * len(item_iist2)

This is very time-consuming when there are many users. In fact, they have not acted on the same item, that is, in most cases, two users will rarely rate the same item , so traversing is very uneconomical.

So we can change the idea of ​​inversion, build an inverted list of items to users, and for each item, save a list of users who have acted on the item. like this:

insert image description here
At this time, scan the inverted table once, and accumulate statistics on the position elements corresponding to the two users in the user list, so as to obtain the number of simultaneous actions on the same item between all users, that is, the numerator of the cosine similarity.

for movie, users in movie_user.items():     # movid是movieID, users是set集合
    for u in users:           # 对于每个用户, 都得双层遍历
        for v in users:
            if u == v:
                continue
            user_sim_matrix.setdefault(u, {
    
    })      # 把字典的值设置为字典的形式
            user_sim_matrix[u].setdefault(v, 0)
            user_sim_matrix[u][v] += 1     # 这里统计两个用户对同一部电影产生行为的次数, 这个就是余弦相似度的分子

# 下面计算用户之间的相似性
for u, related_users in user_sim_matrix.items():
    for v, count in related_users.items():    # 这里面v是相关用户, count是共同对同一部电影打分的次数
        user_sim_matrix[u][v] = count / math.sqrt(len(trainSet[u]) * len(trainSet[v]))   # len 后面的就是用户对电影产生过行为的个数   

References:

Guess you like

Origin blog.csdn.net/wuzhongqiang/article/details/121593181