scrapy学习笔记—— CrawlSpider Requests添加header

CrawlSpider爬虫，在使用rule提取链接后，如何添加headers、cookies

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。

在scrapy中，对有规律或者无规律的网站进行自动爬取时，常用到CrawlSpider类，它通过定义了一些规则(Rule类)提取页面的url，并自动发起request跟进。

Rule参数：

class scrapy.contrib.spiders.Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

LinkExtractor	Link Extractor 对象。其定义了如何从爬取到的页面提取链接。
follow	bool值，指定了根据该规则从response提取的链接是否需要跟进。如果 callback 为None， follow 默认设置为 True ，否则默认为 False 。
process_links	是一个callable或string(该spider中同名的函数将会被调用)。从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
process_request	是一个callable或string(该spider中同名的函数将会被调用)。该规则提取到每个request时都会调用该函数。该函数必须返回一个request或者None。 (用来过滤request)

如果要修改这些request的header或者添加cookie，可以通过process_request来实现

class TestSpider(CrawlSpider):
    name = "test"
    start_urls = [
        "https://www.zhihu.com",
    ]
    myheaders = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip,deflate",
        "Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
        "Connection": "keep-alive",
        "Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
        "Referer": "https://www.zhihu.com"
    } 
    rules = [
            Rule(LinkExtractor(allow= '/topic/\d+$'),
                  process_request='request_tagPage', callback = "parse_tagPage", follow = True)
        ]


    def request_tagPage(self, request):
        newRequest = request.replace(headers=self.myheaders)
        newRequest.meta.update(cookiejar=1)
        return newRequest

    
    pass

参考：

https://stackoverflow.com/questions/38280133/scrapy-rules-not-working-when-process-request-and-callback-parameter-are-set/38347983#38347983

scrapy学习笔记—— CrawlSpider Requests添加header

CrawlSpider爬虫，在使用rule提取链接后，如何添加headers、cookies

猜你喜欢