Python basics: work together to object-oriented (b) of the search engine

Example Search Engine

  A search engine by a searcher four parts, indexer, retrieves, and a user interface consisting of

  Finder is a reptile (scrawler), climbed out of the content to the index generated in the internal database index (Index) storage. A user through a user interface issues an inquiry (Query), parses service inquiry crawler, the crawler efficient search results returned to the user.

  The following five files crawled search samples.

# # 1.txt
# I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

# # 2.txt
# I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

# # 3.txt
# I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.

# # 4.txt
# This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .

# # 5.txt
# And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"

Simple search engine

  SearchEngineBase base class, corpus (corpus)
class SearchEngineBase (Object):
     DEF  the __init__ (Self):
         Pass 
    # file as id, together with the content to process_corpus 
    DEF add_corpus (Self, file_path): 
        with Open (file_path, ' R & lt ' ) AS FIN: 
            text = fin.read () 
        self.process_corpus (file_path, text) 
    # index as the path to the file id, content index storage process as 
    def process_corpus (Self, id, text):
         the raise (Exception ' . process_corpus Not Implemented ' )
     # crawler 
    def Search (Self, Query):
         The raise Exception('search not implemented.')

#多态
def main(search_engine):
    for file_path in ['1.txt', '2.txt', '3.txt', '4.txt', '5.txt']:
        search_engine.add_corpus(file_path)

    while True:
        query = input()
        results = search_engine.search(query)
        print('found {} result(s):'.format(len(results)))
        for result in results:
            print(result)



class SimpleEngine(SearchEngineBase):
    def __init__(self):
        super(SimpleEngine, self).__init__()
        self.__id_to_texts = dict()

    def process_corpus(self, id, text):
        self.__id_to_texts[id] = text

    def search(self, query):
        results = []
        for id, text in self.__id_to_texts.items():
            if query in text:
                results.append(id)
        return results

search_engine = SimpleEngine()
main(search_engine)

########## 输出 ##########
# simple
# found 0 result(s):
# whe
# found 2 result(s):
# 1.txt
# 5.txt
  Disadvantages: Each time indexing and retrieval needs to take up a lot of space, there is only a query word or a few words continuous, powerless to disperse in different positions of a plurality of words

Bag of words model (bag of words model)

  Use bag of words model (bag of words model), one of the simplest model of the field of NLP.
  process_corpus function call parse_text_to_words to each word in the document put into the set collection.
  search function to contain the query keywords are also broken into the set, with the index of the document matching, id will be added to match the result set.
 
import re
class BOWEngine(SearchEngineBase):
    def __init__(self):
        super(BOWEngine, self).__init__()
        self.__id_to_words = dict()

    def process_corpus(self, id, text):
        self.__id_to_words[id] = self.parse_text_to_words(text)

    def search(self, query):
        query_words = self.parse_text_to_words(query)
        results = []
        for id, words in self.__id_to_words.items():
            if self.query_match(query_words, words):
                results.append(id)
        return results
    
    @staticmethod
    def query_match(query_words, words):
        for query_word in query_words:
             if query_word not in words:
                 return False
        return True
        #for query_word in query_words:
        #    return False if query_word not in words else True

        #result = filter(lambda x:x not in words,query_words)
        #return False if (len(list(result)) > 0) else True

    @staticmethod 
    DEF parse_text_to_words (text):
         # use regular expressions removing punctuation and line breaks 
        text = the re.sub (R & lt ' [^ \ W] ' , '  ' , text)
         # lowercase 
        text = text.lower ()
         # generate a list of all words 
        WORD_LIST = text.split ( '  ' )
         # remove blank word 
        WORD_LIST = filter (None, WORD_LIST)
         # returns word SET 
        return SET (WORD_LIST) 

SEARCH_ENGINE = BOWEngine () 
main (SEARCH_ENGINE) 

########## 输出 ##########
# i have a dream
# found 3 result(s):
# 1.txt
# 2.txt
# 3.txt
# freedom children
# found 1 result(s):
# 5.txt

  Disadvantages: Every time still need to search through all the documents

Inverted index inverted index

  Inverted index inverted index, is now reserved word -> id dictionary
import re
class BOWInvertedIndexEngine(SearchEngineBase):
    def __init__(self):
        super(BOWInvertedIndexEngine, self).__init__()
        self.inverted_index = dict()

    #生成索引 word -> id
    def process_corpus(self, id, text):
        words = self.parse_text_to_words(text)
        for word in words:
            if word not in self.inverted_index:
                self.inverted_index[word] = []
            self.inverted_index [Word] .append (ID)# { 'Little': [ '1.txt', '2.txt'], ...} 

    DEF Search (Self, Query): 
        QUERY_WORDS = List (self.parse_text_to_words (Query)) 
        query_words_index = List ()
         for query_word in QUERY_WORDS: 
            query_words_index.append (0) 
        
        # If the inverted index one query word is empty, we immediately returned 
        for query_word in QUERY_WORDS:
             IF query_word not  in self.inverted_index:
                 return [] 
        
        the Result =[]
         The while True: 
            
            # First, get all the inverted index of the current state index 
            current_ids = [] 
            
            for IDX, query_word in the enumerate (QUERY_WORDS): 
                CURRENT_INDEX of = query_words_index [IDX] 
                current_inverted_list = self.inverted_index [query_word] # [ '. 1. TXT ',' 2.txt '] 
                
                # has been traversed to the end of one of the inverted index, the end of Search 
                IF CURRENT_INDEX of> = len (current_inverted_list):
                     return Result 
                current_ids.append (current_inverted_list [CURRENT_INDEX of]) 
            
            # then, if all the elements current_ids are the same, it indicates that this in this document the word element corresponding have emerged 
            IF all (the X-== current_ids [0] forX in current_ids): 
                result.append (current_ids [0]) 
                query_words_index = [X +. 1 for X in query_words_index]
                 Continue 
            
            # If not, we put a smallest element plus 
            MIN_VAL = min (current_ids) 
            min_val_pos = current_ids.index ( MIN_VAL) 
            query_words_index [min_val_pos] + =. 1 

    @staticmethod 
    DEF parse_text_to_words (text):
         # use regular expressions removing punctuation and line breaks
        = the re.sub text (R & lt ' [^ \ W] ' , '  ' , text)
         # lowercase 
        text = text.lower ()
         # generate a list of all words 
        WORD_LIST = text.split ( '  ' )
         # remove word Blank 
        = WORD_LIST filter (None, WORD_LIST)
         # returns word SET 
        return SET (WORD_LIST) 

SEARCH_ENGINE = BOWInvertedIndexEngine () 
main (SEARCH_ENGINE) 


# ######### output ########## 


# Little 
# found Result 2 (S): 
# 1.txt 
# 2.txt 
# Vicious Little 
# found. 1 Result (S): 
# 2.txt
 

LRUCache

  If more than 90% repeat the search, in order to improve performance, consider increasing the cache, using the least recently used Least Recently Used algorithm
import pylru
class LRUCache(object):
    def __init__(self, size=32):
        self.cache = pylru.lrucache(size)
    
    def has(self, key):
        return key in self.cache
    
    def get(self, key):
        return self.cache[key]
    
    def set(self, key, value):
        self.cache[key] = value

class BOWInvertedIndexEngineWithCache(BOWInvertedIndexEngine, LRUCache):
    def __init__(self):
        super(BOWInvertedIndexEngineWithCache, self).__init__()
        LRUCache.__init__(self)

    
    def search(self, query):
        if self.has(query):
            print('cache hit!')
            return self.get(query)
        
        result = super(BOWInvertedIndexEngineWithCache, self).search(query)
        self.set(query, result)
        
        return result

search_engine = BOWInvertedIndexEngineWithCache()
main(search_engine)

########## 输出 ##########
# little
# found 2 result(s):
# 1.txt
# 2.txt
# little
# cache hit!
# found 2 result(s):
# 1.txt
# 2.txt

  Note BOWInvertedIndexEngineWithCache inherited two classes.

  Use super (BOWInvertedIndexEngineWithCache, self) directly in the constructor .__ init __ () to initialize the parent class BOWInvertedIndexEngine
  For multiple inheritance parent class will use LRUCache .__ init __ (self) to initialize
  
  BOWInvertedIndexEngineWithCache in overloading the search function, in which the function to call the search function of the parent class BOWInvertedIndexEngine, use:
  result = super(BOWInvertedIndexEngineWithCache, self).search(query)

 

reference

  Geek Time "Python core technology and practical" column

  https://time.geekbang.org/column/intro/176

Guess you like

Origin www.cnblogs.com/xiaoguanqiu/p/10984178.html