Search Engine: Introduction to Common Information Retrieval Methods and Implementation of Inverted Index (Python)

Article directory

1. Information retrieval method

(1) Linear scan

Computers have many possible ways to retrieve document content, such as traversing directly from the beginning to the end, and extracting content according to the keywords we input.

This type of retrieval method is the same as our human reading habits, so it is simple to implement and easily accepted.

If you are asked whether the phrase "Talk against Confucianism" exists in "The Romance of the Three Kingdoms", we often choose to browse the full text to find matching words.

And it will not take too long to extract keywords from "The Romance of the Three Kingdoms" through modern computers;

But what if the goal is a collection of world literature? What about the company's annual financial report? Or the larger collections of documents produced by the modern information world.

Despite the powerful computing power of computers, the information retrieval method of linear scanning can only be used for processing small texts.

(2) Term-Document Association Matrix

As a result, the term-document correlation matrix was generated, and we used the idle computing power to build the matrix in advance in the computer.

	Three Kingdoms	Dream of Red Mansions	Outlaws of the Marsh	journey to the west	Wukong
Sun Wukong	0	0	0	1	1
Jia Baoyu	0	1	0	0	0

When we input the keyword Sun Wukong, the two documents "Journey to the West" and "The Legend of Wukong" are returned as results.

This method greatly improves the search speed

Terms are the unit of indexing and documents are the returned results.

From the perspective of the row, the document vector of the corresponding term can be obtained; from the perspective of the column, the term vector of the document can be obtained.

Of course, this method also has major shortcomings. As the table I exemplified above, you can see that the 0 value in the matrix accounts for a very high proportion.

While causing a burden, it also greatly slows down the retrieval speed.

(3) Inverted index

Students who have dabbled in data structures should soon be able to think of optimization methods - sparse arrays.

We build term dictionaries and posting lists to efficiently retrieve documents.

dictionary of terms	Postings
Sun Wukong	4—>5
Jia Baoyu	2

The two parts of the inverted index. The dictionary part is often placed in memory, while each posting list pointed to by a pointer is usually stored on disk.

2. Inverted index implementation and common corpus processing methods

(1) To achieve the goal

Read text files to lemmatize and normalize different single-line documents,

Pay attention to the filtering of punctuation marks, the filtering of stop words and the conversion of upper and lower case.

Build an inverted index/terms dictionary.

(2) Complete code

 import re
 import string
 from stop_words import get_stop_words
 from nltk.stem.porter import PorterStemmer
 
 
 # 列表去重
 def unique(word_list):
     return list(dict.fromkeys(word_list))
 
 
 # 移除停用词
 def stop_word(token):
     en_stop = get_stop_words('en')
     stopped_tokens = [i for i in token if i not in en_stop]
     return stopped_tokens
 
 
 # 词干提取
 def stem_extracting(stopped_tokens):
     p_stemmer = PorterStemmer()
     texts = [p_stemmer.stem(i) for i in stopped_tokens]
     return texts
 
 
 # Porter stemmer 并不是要把单词变为规范的那种原来的样子，
 # 它只是把很多基于这个单词的变种变为某一种形式！换句话说，
 # 它不能保证还原到单词的原本，也就是"created"不一定能还原到"create"，
 # 但却可以使"create" 和 "created" ，都得到"creat" ！
 
 def incidence_matrix(text, docID):
     # 目的:
     # 传入一段文本及其文档号
     # 获取其词项列表
 
     # 1.清除英文标点符号
     punctuation_string = string.punctuation
     lines = re.sub('[{}]'.format(punctuation_string), " ", text)
     # 2.将英文文本内容切片
     lsplit = lines.split()
     # 3.大小写转换
     for num in range(len(lsplit)):
         lsplit[num] = lsplit[num].lower()
     # 4.移除停用词 词干提取
     # lsplit = stem_extracting(stop_word(lsplit))
     # 5.去重并转字典
     lsplit_dic = dict.fromkeys(lsplit)
     # 6.记录文档号
     for word in lsplit_dic.keys():
         lsplit_dic[word] = [docID]
     return lsplit_dic
 
 
 def read_file(filename):
     result = {
    
    }
     count = 0
     with open(filename, 'r') as file_to_read:
         # 以只读形式打开该文件
         while True:
             # 以行为单位进行读取
             lines = file_to_read.readline()
             # 当某行内容为空时 停止读取
             if len(lines) == 0:
                 break
             count = count + 1
             lsplot = incidence_matrix(lines, count)
             result = dic_zip(result, lsplot)
     # 关闭文件读取
     file_to_read.close()
     return result
 
 
 def dic_sort(a_dic):
     b_dic = dict.fromkeys(sorted(a_dic))
     for word in b_dic.keys():
         b_dic[word] = a_dic[word]
     return b_dic
 
 
 # 不同文档字典 同一词项合并
 def dic_zip(a_dic, b_dic):
     # 将b_dic并入a_dic中
     for word in b_dic.keys():
         if a_dic.get(word, None):
             a_dic[word].extend(b_dic[word])
         else:
             a_dic[word] = b_dic[word]
     return a_dic
 
 
 def show_dic(a_dic):
     # 文档频率可用于做查询优化
     tplt = "{0:^10}\t{1:{3}^10}\t{2:^40}"
     print(tplt.format("词项", "文档频率", "倒排记录", chr(12288)))
     for word in a_dic.keys():
         print(tplt.format(word, len(a_dic[word]), str(a_dic[word]), chr(12288)))
 
 
 def main():
     # 读取filename下的英文文本文件 将每一行作为单独的文本
     # 建立倒排索引。
     filename = './Reverse_Index_Word2.txt'
     matrix = dic_sort(read_file(filename=filename))
     show_dic(matrix)
 
 
 if __name__ == '__main__':
     main()

(3) Running results

Larger-scale documents can be read, and the small text is selected here for ease of display.

The normalization process is commented out by me, and the # symbol can be deleted when needed.

 #读入的文档
 new home sales top forecasts
 home sales rise in july
 increase in home sales in july
 july new home sales rise
 
 # 运行结果
 "D:\Program Files\Python\python.exe" 
     词项      　 文档频率　　　                  倒排记录                  
 forecasts   　　　　1　　　　　                    [1]                   
    home     　　　　4　　　　　                [1, 2, 3, 4]              
     in      　　　　2　　　　　                   [2, 3]                 
  increase   　　　　1　　　　　                    [3]                   
    july     　　　　3　　　　　                 [2, 3, 4]                
    new      　　　　2　　　　　                   [1, 4]                 
    rise     　　　　2　　　　　                   [2, 4]                 
   sales     　　　　4　　　　　                [1, 2, 3, 4]              
    top      　　　　1　　　　　                    [1]                   
 
 Process finished with exit code 0