[Scrapy framework] "Version 2.4.0 source code" crawler page (Spiders) detailed articles

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

Under the spider folder is each py crawler file to be executed. The py script in this folder is executed by command line execution to realize the business content of data capture.

Spider parameter explanation

Command to create spider content

scrapy genspider xxxxx xxxxxx.com
  1. The name
    of the instantiated object generated by the name crawler file is generated through the command line. When the script is executed, it is executed by this name and does not need to be modified. It must be unique in a crawler project.
name = xxxxx # 这里对应的是命令行的第三部分
  1. allowed_domains
    Allowed domains to be crawled, optional. If the domain name is set, the url under the non-domain name cannot be processed.
allowed_domains = []
allowed_domains = ["https://xxxxxx.com",] # 这里对应的是命令行的最后一部分
  1. The
    list of URLs to be processed by start_urls is traversed through the start_requests method of the spider base class.
    Later, in the content part of crawler management, we will introduce the content of rewriting this part in terms of management.
start_urls = [
	'http://aaaa/',
	'http://bbbb/',
	'http://cccc/',
]
  1. Custom_settings
    exclusive spider configuration, if you override this setting, it will override the project's global settings, and it must be defined as a class variable. It is recommended to set in the settings for easy management. Don't modify it here.

  2. The crawler object of the crawler spider class is used to obtain the settings in the settings configuration information, such as the settings in middlewares, pipelines, etc.
    No modification is required by default.
  3. settings
    Settings example, reading unified messaging settings configuration.
    No modification is required by default.
  4. The logger is
    used to send project logs during the data capture process.
    No modification is required by default.
  5. from_crawler The
    method used by Scrapy to create the spider class.
    No modification is required by default.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = cls(*args, **kwargs)
    spider._set_crawler(crawler)
    return spider
# 创建一个实例的对象
def __init__(self):
    #实例化一个浏览器对象(实例化一次)
    self.bro = webdriver.Firefox(executable_path=chrm,options =options )
  1. start_requests()
    generates the initial request. By default , the URL in the start_urls list is used to construct the request, and the request is a get request.
    If you need to use the post method or parameter passing, you need to rewrite the start_requests method.

  2. The default callback used by parse Scrapy is used to process the data. The parse method with various names can be customized for data capture.
    def start_requests(self):
        parse_list = [
            self.parse1,
            self.parse2,
        ]

        # 非API接口方法
        for list_num in range(len(self.start_urls)):
            for url_num in range(len(self.start_urls[list_num])):
                yield scrapy.Request(url=self.start_urls[list_num][url_num],
                                     meta={
    
    'channel_name': self.channel_name_list[list_num][url_num],
                                           'type_id': self.type_id_list[list_num][url_num]},
                                     callback=parse_list[list_num])
                                    
    def parse1(self, response):
    	......
    def parse2(self, response):
    	......

  1. The method of closed crawler.
    #必须在整个爬虫结束后,关闭浏览器
    def closed(self,spider):
        print('爬虫结束')
        self.bro.quit()

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113480328