Thirteen: Crawler-Scrapy framework (Part 2)

1: Review of the use of each file

1. itemsUse of

itemsThe file is mainly used to define the data structure for storing crawled data, so as to facilitate Item Pipelinethe transfer of data between the crawler and.

items.py

import scrapy

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    position = scrapy.Field()
    date = scrapy.Field()

2. piplineUse of

(1 pipelinesIntroduction

Pipeline files pipelines.pyare mainly used to process the captured data: generally a class is a pipeline, such as creating a pipeline class for storing MySQLand storing MangoDB. The method in the pipeline file process_item()is the specific method for processing the captured data.

(2) pipelinesCommonly used methods
  1. process_item(self,item,spider): Processing the specific data captured by the crawler process_item()is necessary in the function return item, because when there are multiple pipelines, the return value of this function will be handed over to the next pipeline for continued processing;
  2. open_spider(): Executed only once when the crawler project is started, generally used for database connection;
  3. close_spider(): Executed only once at the end of the crawler project, generally used for finishing work, such as closing the database.
(3) pipelinesPoints to note
  1. pipelineThe smaller the corresponding value, the higher the priority.
  2. pipelineThe name of the method process_itemcannot be changed to another name

2: Workflow review

1. How to handle page turning

2. scrapy.RequestKnowledge points

scrapy.Request(url, callback=None, method='GET', headers=None, body=None,cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, flags=None)

常用参数为:
callback:指定传入的URL交给那个解析函数去处理
meta:实现不同的解析函数中传递数据,meta默认会携带部分信息,比如下载延迟,请求深度
dont_filter:让scrapy的去重不会过滤当前URL,scrapy默认有URL去重功能,对需要重复请求的URL有重要用途

Three: ScrapyDownload middleware

Download middleware is provided for modifiable and extended functions scrapyduring the crawler process . Usage:RequestResponsescrapy

  • Write one just like Download Middlewareswe wrote one pipeline, define a class, and then settingsenable Download Middlewaresthe default method in
  • Processing requests and processing responses correspond to two methods:
process_request(self,request,spider):
    当每个request通过下载中间件时,该方法被调用

process_response(self,request,response,spider):
    当下载器完成http请求,传递响应给引擎的时候调用

When each Requestobject passes through the download middleware, it will be called. The higher the priority middleware, the earlier it will be called; this method should return the following objects: None/ResponseObject/ RequestObject/Throw IgnoreRequestException

  • Return None:scrapywill continue to execute the corresponding methods of other middleware;
  • Return Responseobject: scrapyIt will not call other middleware process_requestmethods or initiate downloads, but directly return the Responseobject.
  • Return Requestobject: scrapyNo other middleware process_request()methods will be called, but it will be placed in the scheduler to be scheduled for download.
  • If this method throws an exception, process_exceptionthe method will be called

process_response(request,response,spider)
will be called when each Response passes through the download middleware. The higher the priority middleware, the later it will be called. On the process_request()contrary; this method returns the following object: ResponseObject/Request Object/Throws IgnoreRequestException .

  • Return Responseobject: scrapywill continue to call other middleware process_responsemethods;
  • Return Requestobject: Stop the middleman call and place it in the scheduler to be scheduled for download;
  • Throws IgnoreRequestan exception: Request.errbackThe function will be called to handle it. If it is not handled, it will be ignored and will not be written to the log.

1.Middleware workflow

Download middleware works as follows:

  1. When Scrapythe engine receives a request that needs to be downloaded, it will send the request to the download middleware.
  2. After the download middleware receives the request, it can modify the request, such as adding headers, proxying, etc.
  3. The modified request is sent to the target server, and the target server returns response data.
  4. After the download middleware receives the response data, it can modify the response, such as decrypting, decompressing, modifying the encoding, etc.
  5. The modified response is returned to Scrapythe engine, and the engine continues to process the response data.

2. Set randomness through middlewareUA

When a crawler frequently accesses a page, the request must remain consistent. Then it is easy to be discovered by the server, thereby prohibiting access to this request header. Therefore, we need to randomly change the request header before accessing this page, so as to avoid the crawler being caught. Randomly changing the request header can be implemented in the download middleware. Before the request is sent to the server, a request header is randomly selected. This avoids always using a request header.

需求: 通过中间件设置随机UA
中间件核心介绍:
# 拦截所有的请求
    def process_request(self, request, spider):
        # request 是请求对象  spider指向的是当前爬虫对象
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # 返回空 继续执行这个方法送往下载器 等合适的下载器进行处理
        # - or return a Response object
        # 返回的是一个响应对象 终止当前流程 直接将该响应通过引擎返回给爬虫
        # - or return a Request object
        # 返回的是一个请求对象 终止当前流程 将请求对象返回给调度器 大多数情况下是更换新的request请求
        # - or raise IgnoreRequest: process_exception() methods of
        # 抛出异常  异常会给process_exception()方法进行处理 如果没有一个异常处理该异常
        # 那么该请求就直接被忽略了 也不会记录错误日志
        #   installed downloader middleware will be called
        return None


# 自定义下载中间件
# 导入随机UA的库
import random
from fake_useragent import UserAgent

class UADownloaderMiddleware:
    def process_request(self, request, spider):
        ua = UserAgent()
        user_agent = ua.random
        request.headers['User-Agent'] = user_agent

注意: 在settings中开启当前中间件 
DOWNLOADER_MIDDLEWARES = {
    
    
   # 'mw.middlewares.MwDownloaderMiddleware': 543,
   'mw.middlewares.UADownloaderMiddleware': 543,
}

爬虫程序.py
class UaSpider(scrapy.Spider):
    name = 'ua'
    allowed_domains = ['httpsbin.org']
    start_urls = ['https://httpbin.org/user-agent']

    def parse(self, response):
        print(response.text)
        # dont_filter=True scrapy会进行自动去重
        yield scrapy.Request(url=self.start_urls[0],
                             callback=self.parse,
                             dont_filter=True)

3. ScrapyDownload pictures

scrapyitemProvides a reusable version of the files included in the download item pipelines. These pipelinehave some common methods and structures that you would generally useImages Pipeline

Built-in way to download pictures:

Steps to use images pipelinedownload files:

  • Define one Item, and then itemdefine two attributes in this one, namely image_urlsand images. It is used to store links image_urlsto files that need to be downloaded . A list needs to be given.url
  • When the file download is completed, the relevant information about the file download will be stored itemin imagesthe attributes. Such as download path, downloaded urland image verification code, etc.
  • settings.pyConfigure in the configuration file IMAGES_STORE. This configuration is used to set the file download path.
  • Startup pipeline: ITEM_PIPELINESSet inscrapy.pipelines.images.ImagesPipeline:1

Guess you like

Origin blog.csdn.net/qiao_yue/article/details/135309772