[Python Crawler] - Implementation of URL Manager

The role of the url manager

  • In the Python crawler, the URL Manager (URL Manager) is an important component for effectively managing the URLs involved in the crawling process. It is mainly responsible for the following tasks:

    • URL deduplication (deduplication) : During the crawling process, the same URL will be repeatedly crawled, which not only wastes time and resources, but may also lead to data duplication. The URL manager ensures that each URL is crawled only once by maintaining a collection of URLs that have already been crawled, avoiding duplication.

    • URL scheduling (scheduling order) : The crawler needs to decide which URL to crawl next. The URL manager is responsible for selecting the next URL to be crawled according to a certain strategy, and can use different scheduling strategies such as first-in-first-out (FIFO), last-in-first-out (LIFO), and priority queues.

    • New URL addition : When new URLs are parsed from a web page, the URL manager is responsible for adding these new URLs to the queue of URLs to be crawled, so as to ensure that the crawler can continuously explore new pages.

    • URL status management : The URL manager can record the status of each URL, such as whether it has been crawled, whether it was successful, the number of failures, etc. This helps with optimization and error handling during subsequent crawls.

    • Data persistence : After the crawler runs, the URL manager usually saves the crawled URLs so that the previous state can be restored at the next run.

  • The URL manager usually consists of two parts: the queue of URLs to be crawled (the collection of URLs to be crawled) and the collection of URLs to be crawled . These two parts work together to ensure that the crawler can run efficiently, do not crawl URLs repeatedly, and schedule URLs according to appropriate strategies.

To sum up, the URL manager plays an important role in integration, coordination, deduplication and scheduling in the crawler, helping the crawler to obtain the required information more efficiently.

Python implementation

class UrlManager():
    """url管理器"""

    def __init__(self):
        # 初始化待爬取url和已爬取url
        self.new_urls = set()
        self.old_urls = set()
    
    def add_new_url(self, url):
        """添加新的url"""
        if (url is None) or (len(url) == 0):
            return("Error! The URL to be added is empty")
        if (url in self.new_urls) or (url in self.old_urls):
            return('Error! The URL to be added already exists')
        self.new_urls.add(url)
    
    def add_new_urls(self, urls):
        """批量添加新的url"""
        if (urls is None) or (len(urls) == 0):
            return("Error! The URLs to be added is empty")
        
        for url in urls:
            self.add_new_url(url)
    
    def get_url(self):
        """获取爬取url,并记录"""
        if self.find_new_url():
            url = self.new_urls.pop()
            self.old_urls.add(url)
            return url
        else:
            return("Crawling completed, the URL to be crawled is empty")
    
    def find_new_url(self):
        return len(self.new_urls) > 0
    

if __name__ == "__main__":
    url_manager = UrlManager()
    url_manager.add_new_url('url1')
    url_manager.add_new_urls(['url1','url2'])
    print("new_urls:{}, old_urls:{}".format(url_manager.new_urls, url_manager.old_urls))

    print("+"*30)
    new_url = url_manager.get_url()
    print("new_urls:{}, old_urls:{}".format(url_manager.new_urls, url_manager.old_urls))

    print("+"*30)
    new_url = url_manager.get_url()
    print("new_urls:{}, old_urls:{}".format(url_manager.new_urls, url_manager.old_urls))

    print("+"*30)
    print(url_manager.find_new_url())


"""
output:
new_urls:{'url2', 'url1'}, old_urls:set()
++++++++++++++++++++++++++++++
new_urls:{'url1'}, old_urls:{'url2'}
++++++++++++++++++++++++++++++
new_urls:set(), old_urls:{'url2', 'url1'}
++++++++++++++++++++++++++++++
False
"""

Guess you like

Origin blog.csdn.net/qq_38734327/article/details/132553192