All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Article Directory

Introduction
Request objects
Stop responding
Request subclass
- class scrapy.http.FormRequest(url[, formdata, ...])
- class scrapy.http.JsonRequest(url[, ... data, dumps_kwargs])
Response object
- class scrapy.http.Response（* args，** kwargs）
Response subclass

Introduction

Request and Response in Scrapy are used to crawl website data.

The Request object is generated during data capture and passed through the system to the downloader, which executes the request and returns a Response object, which is returned to the spider that made the request.

Request objects

class scrapy.http.Request(*args, **kwargs)

Main parameters:
url (str) : the requested URL, if the URL is invalid, ValueError will raise an exception

callback : callback function, used to execute the next specified callback method

method : HTTP access method, default GET

meta : a dictionary of passed parameters

body : request text information captured, the default encoding is utf-8

headers : the hearder information of the set request, which is usually replaced randomly by a new method

cookies : the cookie information of the set request, the general storage method is a dictionary

encoding : the default is utf-8

priority : the priority of the handler

dont_filter : Whether to ignore duplicate filter settings

errback : Error callback function

flags : used for logging

cb_kwargs : A dict of arbitrary data, which will be passed to the requested callback function as keyword parameters

Additional data is passed to the callback function

The requested callback is a function that is called when the response to the request is downloaded

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

Use Request.cb_kwargs attribute to receive parameters in the second callback

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             cb_kwargs=dict(main_url=response.url))
    request.cb_kwargs['foo'] = 'bar'  # add more arguments for the callback
    yield request

def parse_page2(self, response, main_url, foo):
    yield dict(
        main_url=main_url,
        other_url=response.url,
        foo=foo,
    )

Use fallback to catch exceptions in request processing

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # 200 正常访问
        "http://www.httpbin.org/status/404",    # 404 无页面
        "http://www.httpbin.org/status/500",    # 500 服务器挂了
        "http://www.httpbin.org:12345/",        # 超时访问不到主机端口
        "http://www.httphttpbinbin.org/",       # DNS解析错误
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
		......

    def errback_httpbin(self, failure):
        # 全部的失败日志
        self.logger.error(repr(failure))
		# 如果想对某些错误进行特殊处理，需要失败的类型

        if failure.check(HttpError):
			# 这些异常来自HttpError Spider中间件获得非200的相应
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # 这是原始请求
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

Access other data in the errback function

When processing a request fails, it may be necessary to access the parameters of the callback function, which can be processed according to the parameters in errback.

def parse(self, response):
    request = scrapy.Request('http://www.example.com/index.html',
                             callback=self.parse_page2,
                             errback=self.errback_page2,
                             cb_kwargs=dict(main_url=response.url))
    yield request

def parse_page2(self, response, main_url):
    pass

def errback_page2(self, failure):
    yield dict(
        main_url=failure.request.cb_kwargs['main_url'],
    )

Request.meta special key

For more detailed operations, please refer to middleware and practical application cases, here are only some simple explanations

dont_redirect : If the dont_redirect key of Request.meta is set to True, the middleware will ignore the request.
dont_retry : If the dont_retry key of Request.meta is set to True, this middleware will ignore the request.
handle_httpstatus_list : If you still want to process response codes outside this range, you can use the handle_httpstatus_list spider attribute or HTTPERROR_ALLOWED_CODES to set the response codes that the spider can handle.
handle_httpstatus_all : The handle_httpstatus_list key of Request.meta can also be used to specify the response code allowed for each request. If you want to allow any response code requested, you can also set the meta key handle_httpstatus_all to True.
dont_merge_cookies : request cookies.
cookiejar : Support using cookiejar to request meta-keys to keep multiple cookie sessions in each spider. By default, it uses a single cookie jar (session), but you can pass an identifier to use other identifiers.
dont_cache : Avoid using the non-True dont_cache meta key to cache responses on each strategy.
redirect_reasons : The reason for each redirection in redirect_urls can be found in the redirect_reasons Request.meta key.
redirect_urls : The URL through which the request passes (when redirecting) can be found in the redirect_urls Request.meta key.
bindaddress : The IP of the outgoing IP address used to execute the request.
dont_obey_robotstxt : If the dont_obey_robotstxt key of Request.meta is set to True, the middleware will ignore the request even if ROBOTSTXT_OBEY is enabled.
download_timeout : The time (in seconds) the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.
download_maxsize : You can use the download_maxsize spider attribute to set the size for each spider, and use the download_maxsize Request.meta key to set the size for each request.
download_latency : The time it takes to get the response since the start of the request (that is, the HTTP message sent over the network). This meta key is only available after downloading the response. Although most other meta keys are used to control Scrapy behavior, this key should be read-only.
download_fail_on_dataloss : Whether it failed due to response failure.
proxy : The middleware sets the HTTP proxy for the request by setting the proxy element value for the Request object.
ftp_user : When there is no "ftp_user" in the request element, the username used for the FTP connection.
ftp_password : When there is no "ftp_password" in the request element, the password used for the FTP connection.
referrer_policy : Referrer policy to be applied when filling the request "Referer" header.
max_retry_times : Use the meta key to set the retry time for each request. After initialization, the priority of the max_retry_times meta key is higher than the RETRY_TIMES setting.

Stop responding

import scrapy


class StopSpider(scrapy.Spider):
    name = "stop"
    start_urls = ["https://docs.scrapy.org/en/latest/"]

    @classmethod
    def from_crawler(cls, crawler):
        spider = super().from_crawler(crawler)
        crawler.signals.connect(spider.on_bytes_received, signal=scrapy.signals.bytes_received)
        return spider

    def parse(self, response):
        # “ last_chars”表明未下载完整响应
        yield {
    
    "len": len(response.text), "last_chars": response.text[-40:]}

    def on_bytes_received(self, data, request, spider):
        raise scrapy.exceptions.StopDownload(fail=False)

Request subclass

class scrapy.http.FormRequest(url[, formdata, …])

The FormRequest object, a class used to pass parameters, has the function of processing HTML forms. Returns a new FormRequest object whose form field values are pre-populated with the form field values in the HTML elements contained in the given response.

The main parameter
response (Response object): contains the response of the HTML form used to fill the form fields

formname (str): If defined, the form whose name attribute is set to this value will be used.

formid (str): If defined, the form with the id attribute set to this value will be used.

formxpath (str): If defined, the first form matching xpath will be used.

formcss (str): If defined, the first form that matches the css selector will be used.

formnumber (int): The number of forms to be used when the response contains multiple forms. The first (and default value) is 0.

formdata (dict): The fields to be covered in the form data. If a field already exists in the response element, the value of that field will be overwritten by the field passed in this parameter. If the value passed in this parameter is None, even if the field exists in the response element, the field will not be included in the request.

clickdata (dict): Find the attributes of the clicked control. If not provided, it will simulate the click on the first clickable element to submit the form data. In addition to the html attribute, the nr attribute can also be used to identify the control with a zero-based index relative to other submittable input in the form.

dont_click (bool): If True, the form data can be submitted without clicking any element.

Use FormRequest to send data via HTTP POST

import scrapy

def authentication_failed(response):
	# 检查响应的内容，如果失败，则返回True；如果成功，则返回False。
    pass

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={
    
    'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        if authentication_failed(response):
            self.logger.error("Login failed")
            return

class scrapy.http.JsonRequest(url[, … data, dumps_kwargs])

Header settings in JsonRequest

{
    
    
	'Content-Type':'application/json',
	'Accept':'application/json, text/javascript, */*; q=0.01'
}

Parameter Description:

data (object): is any JSON serializable object that requires JSON encoding and is assigned to the main body. If Request.body provides a parameter, this parameter will be ignored. If Request.body does not provide parameters and provides data parameters, Request.method will automatically set'POST' to.
dumps_kwargs (dict): The parameters that will be passed to the basic json.dumps() method, which is used to serialize data into JSON format.

Send a JSON POST request with a JSON payload

data = {
    
    
    'name1': 'value1',
    'name2': 'value2',
}
yield JsonRequest(url='http://www.example.com/post/action', data=data)

Response object

class scrapy.http.Response（* args，** kwargs）

A Response object represents an HTTP response, and the HTTP response is usually downloaded (downloaded by the downloader), and then sent to the Spider for processing.

Parameter Description:

url (str): the URL of the response
status (int): The HTTP status of the response. The default is 200.
headers (dict): The headers of this response. The dict value can be a string (for single-value headers) or a list (for multi-value headers).
body (bytes): Response body. To access the decoded text as a string, use response.text to recognize the coded Response subclass, such as TextResponse.
flags (list): is a list containing the initial value of the Response.flags attribute. If given, the list will be superficially copied.
request (scrapy.http.Request): The initial value of the Response.request property. This means that Request generated this response.
certificate (twisted.internet.ssl.Certificate): represents the object of the server SSL certificate.
ip_address (ipaddress.IPv4Address or ipaddress.IPv6Address): The IP address of the server that initiated the response.

Response subclass

A list of available built-in Response subclasses. The Response class can be subclassed to achieve the set function.

class scrapy.http.TextResponse(url[, encoding[, …]])

The TextResponse object adds an encoding function to the base Response class, which can only be used for binary data, such as images, sounds, or any media files.

The data processing method here is the same as that of the response part, both of which are processing strings, so I won’t repeat it.

class scrapy.http.HtmlResponse(url[, …])

The HtmlResponse object, which adds support for the META HTTP-EQUIV attribute automatically discovered by viewing the HTML code.

class scrapy.http.XmlResponse(url[, …])

The XmlResponse object, which adds support for automatic discovery by viewing the XML declaration line encoding.

[Scrapy framework] "Version 2.4.0 source code" request and response (Requests and Responses) detailed articles