Example Search Engine
A search engine by a searcher four parts, indexer, retrieves, and a user interface consisting of
Finder is a reptile (scrawler), climbed out of the content to the index generated in the internal database index (Index) storage. A user through a user interface issues an inquiry (Query), parses service inquiry crawler, the crawler efficient search results returned to the user.
The following five files crawled search samples.
# # 1.txt # I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today. # # 2.txt # I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today. # # 3.txt # I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together. # # 4.txt # This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . . # # 5.txt # And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"
Simple search engine
SearchEngineBase base class, corpus (corpus)
class SearchEngineBase (Object): DEF the __init__ (Self): Pass # file as id, together with the content to process_corpus DEF add_corpus (Self, file_path): with Open (file_path, ' R & lt ' ) AS FIN: text = fin.read () self.process_corpus (file_path, text) # index as the path to the file id, content index storage process as def process_corpus (Self, id, text): the raise (Exception ' . process_corpus Not Implemented ' ) # crawler def Search (Self, Query): The raise Exception('search not implemented.') #多态 def main(search_engine): for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']: search_engine.add_corpus(file_path) while True: query = input() results = search_engine.search(query) print('found {} result(s):'.format(len(results))) for result in results: print(result) class SimpleEngine(SearchEngineBase): def __init__(self): super(SimpleEngine, self).__init__() self.__id_to_texts = dict() def process_corpus(self, id, text): self.__id_to_texts[id] = text def search(self, query): results = [] for id, text in self.__id_to_texts.items(): if query in text: results.append(id) return results search_engine = SimpleEngine() main(search_engine) ########## 输出 ########## # simple # found 0 result(s): # whe # found 2 result(s): # 1.txt # 5.txt
Disadvantages: Each time indexing and retrieval needs to take up a lot of space, there is only a query word or a few words continuous, powerless to disperse in different positions of a plurality of words
Bag of words model (bag of words model)
Use bag of words model (bag of words model), one of the simplest model of the field of NLP.
process_corpus function call parse_text_to_words to each word in the document put into the set collection.
search function to contain the query keywords are also broken into the set, with the index of the document matching, id will be added to match the result set.
import re class BOWEngine(SearchEngineBase): def __init__(self): super(BOWEngine, self).__init__() self.__id_to_words = dict() def process_corpus(self, id, text): self.__id_to_words[id] = self.parse_text_to_words(text) def search(self, query): query_words = self.parse_text_to_words(query) results = [] for id, words in self.__id_to_words.items(): if self.query_match(query_words, words): results.append(id) return results @staticmethod def query_match(query_words, words): for query_word in query_words: if query_word not in words: return False return True #for query_word in query_words: # return False if query_word not in words else True #result = filter(lambda x:x not in words,query_words) #return False if (len(list(result)) > 0) else True @staticmethod DEF parse_text_to_words (text): # use regular expressions removing punctuation and line breaks text = the re.sub (R & lt ' [^ \ W] ' , ' ' , text) # lowercase text = text.lower () # generate a list of all words WORD_LIST = text.split ( ' ' ) # remove blank word WORD_LIST = filter (None, WORD_LIST) # returns word SET return SET (WORD_LIST) SEARCH_ENGINE = BOWEngine () main (SEARCH_ENGINE) ########## 输出 ########## # i have a dream # found 3 result(s): # 1.txt # 2.txt # 3.txt # freedom children # found 1 result(s): # 5.txt
Disadvantages: Every time still need to search through all the documents
Inverted index inverted index
Inverted index inverted index, is now reserved word -> id dictionary
import re class BOWInvertedIndexEngine(SearchEngineBase): def __init__(self): super(BOWInvertedIndexEngine, self).__init__() self.inverted_index = dict() #生成索引 word -> id def process_corpus(self, id, text): words = self.parse_text_to_words(text) for word in words: if word not in self.inverted_index: self.inverted_index[word] = [] self.inverted_index [Word] .append (ID)# { 'Little': [ '1.txt', '2.txt'], ...} DEF Search (Self, Query): QUERY_WORDS = List (self.parse_text_to_words (Query)) query_words_index = List () for query_word in QUERY_WORDS: query_words_index.append (0) # If the inverted index one query word is empty, we immediately returned for query_word in QUERY_WORDS: IF query_word not in self.inverted_index: return [] the Result =[] The while True: # First, get all the inverted index of the current state index current_ids = [] for IDX, query_word in the enumerate (QUERY_WORDS): CURRENT_INDEX of = query_words_index [IDX] current_inverted_list = self.inverted_index [query_word] # [ '. 1. TXT ',' 2.txt '] # has been traversed to the end of one of the inverted index, the end of Search IF CURRENT_INDEX of> = len (current_inverted_list): return Result current_ids.append (current_inverted_list [CURRENT_INDEX of]) # then, if all the elements current_ids are the same, it indicates that this in this document the word element corresponding have emerged IF all (the X-== current_ids [0] forX in current_ids): result.append (current_ids [0]) query_words_index = [X +. 1 for X in query_words_index] Continue # If not, we put a smallest element plus MIN_VAL = min (current_ids) min_val_pos = current_ids.index ( MIN_VAL) query_words_index [min_val_pos] + =. 1 @staticmethod DEF parse_text_to_words (text): # use regular expressions removing punctuation and line breaks = the re.sub text (R & lt ' [^ \ W] ' , ' ' , text) # lowercase text = text.lower () # generate a list of all words WORD_LIST = text.split ( ' ' ) # remove word Blank = WORD_LIST filter (None, WORD_LIST) # returns word SET return SET (WORD_LIST) SEARCH_ENGINE = BOWInvertedIndexEngine () main (SEARCH_ENGINE) # ######### output ########## # Little # found Result 2 (S): # 1.txt # 2.txt # Vicious Little # found. 1 Result (S): # 2.txt
LRUCache
If more than 90% repeat the search, in order to improve performance, consider increasing the cache, using the least recently used Least Recently Used algorithm
import pylru class LRUCache(object): def __init__(self, size=32): self.cache = pylru.lrucache(size) def has(self, key): return key in self.cache def get(self, key): return self.cache[key] def set(self, key, value): self.cache[key] = value class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache): def __init__(self): super(BOWInvertedIndexEngineWithCache, self).__init__() LRUCache.__init__(self) def search(self, query): if self.has(query): print('cache hit!') return self.get(query) result = super(BOWInvertedIndexEngineWithCache, self).search(query) self.set(query, result) return result search_engine = BOWInvertedIndexEngineWithCache() main(search_engine) ########## 输出 ########## # little # found 2 result(s): # 1.txt # 2.txt # little # cache hit! # found 2 result(s): # 1.txt # 2.txt
Note BOWInvertedIndexEngineWithCache inherited two classes.
Use super (BOWInvertedIndexEngineWithCache, self) directly in the constructor .__ init __ () to initialize the parent class BOWInvertedIndexEngine
For multiple inheritance parent class will use LRUCache .__ init __ (self) to initialize
BOWInvertedIndexEngineWithCache in overloading the search function, in which the function to call the search function of the parent class BOWInvertedIndexEngine, use:
result = super(BOWInvertedIndexEngineWithCache, self).search(query)