【爬虫】Python Scrapy 基础概念 —— 请求和响应

【原文链接】https://doc.scrapy.org/en/latest/topics/request-response.html

Scrapy uses Request and Response 对象来爬网页.

Typically, spiders 中会产生 Request 对象，然后传递 across the system, 直到他们到达 Downloader, which 执行请求并返回一个 Response 对象 which travels back to the spider that issued the request.

Both Request and Response 类都有子类 which 增加了基类中 not required 的功能. These are described below in Request subclasses and Response subclasses.

Request 对象

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

A Request 对象代表了一次 HTTP 请求, which is usually generated in the Spider and executed by the Downloader, and thus 生成一个 Response.

Parameters:

Parameters:	url (string) – the URL of this request callback (callable) – the function that will be called 且这个请求对应的响应 (once its downloaded) 会成为该方法的第一个参数. For more information see Passing additional data to callback functions below. If a Request 没有指定回调函数, the spider’s `parse()` 方法会被使用. Note that 如果过程中产生了异常, errback is called instead. method (string) – the HTTP 方法 of this request. Defaults to `'GET'`. meta (dict) – `Request.meta` 属性的初始值. If given, 此参数传递进来的字典会被浅拷贝. body (str or unicode) – 请求体. 如果传递的是一个 `unicode`, 那么会使用传递进来的 encoding (默认为 `utf-8`) 编码成 `str`. 如果没有指定 `body`, 会存储一个空字符串. 不论该 argument 是什么类型, 最终存储的值会是一个 `str` (never `unicode` or `None`). headers (dict) – 请求头. 字典的值可以为字符串 (对于单一值请求头而言) 或列表 (对于多值请求头而言). 如果传递的值是 `None`, the HTTP 头不会被发送. cookies (dict or list) – 请求的 cookies. 可以通过两种形式进行发送. 使用字典: `request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})` 使用字典列表: `request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])` 后一种形式 allows for customizing cookie 的 `domain` and `path` 属性. 只有在 cookies 被保存 for later requests 时才有用. 当有些网站 (in a response) 返回了 cookies 时，会被存在这个域名的 cookies 中，然后 in future requests 会被再次发送. 这是一般网页浏览器的典型行为. 然而，出于某些原因，你可能想避免 merging with existing cookies, 你可以设置 the `dont_merge_cookies` key 为 True in the `Request.meta`. Example of request without merging cookies: `request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}, meta={'dont_merge_cookies': True})` For more info see CookiesMiddleware. encoding (string) – the encoding of this request (defaults to `'utf-8'`). This encoding will be used to percent-encode the URL and to convert the body to `str` (if given as `unicode`). priority (int) – 这个请求的优先级 (defaults to `0`). Scheduler 会使用这个优先级来定义处理请求的顺序. 有更高优先级的值的请求会更早执行. 负数值 are allowed, 可以用来表示相对较低的优先级. dont_filter (boolean) – 表示该请求 should not be filtered by the scheduler. 当你想多次执行同一个相同的请求时，可以使用该参数 to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to `False`. errback (callable) – a function that will be called if any exception was raised while processing the request. 包括失败页面 with 404 HTTP errors and such. 该方法会接收一个 Twisted Failure 实例作为第一个参数. For more information, see Using errbacks to catch exceptions in request processing below. flags (list) – Flags sent to the request, 可以用于打印日志 or similar purposes.

url (string) – the URL of this request
callback (callable) – the function that will be called 且这个请求对应的响应 (once its downloaded) 会成为该方法的第一个参数. For more information see Passing additional data to callback functions below. If a Request 没有指定回调函数, the spider’s parse() 方法会被使用. Note that 如果过程中产生了异常, errback is called instead.
method (string) – the HTTP 方法 of this request. Defaults to 'GET'.
meta (dict) – Request.meta 属性的初始值. If given, 此参数传递进来的字典会被浅拷贝.
body (str or unicode) – 请求体. 如果传递的是一个 unicode, 那么会使用传递进来的 encoding (默认为 utf-8) 编码成 str. 如果没有指定 body, 会存储一个空字符串. 不论该 argument 是什么类型, 最终存储的值会是一个 str (never unicode or None).
headers (dict) – 请求头. 字典的值可以为字符串 (对于单一值请求头而言) 或列表 (对于多值请求头而言). 如果传递的值是 None, the HTTP 头不会被发送.
cookies (dict or list) – 请求的 cookies. 可以通过两种形式进行发送.
1. 使用字典:
```
request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'})
```
2. 使用字典列表:
```
request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])
```
后一种形式 allows for customizing cookie 的 domain and path 属性. 只有在 cookies 被保存 for later requests 时才有用.

当有些网站 (in a response) 返回了 cookies 时，会被存在这个域名的 cookies 中，然后 in future requests 会被再次发送. 这是一般网页浏览器的典型行为. 然而，出于某些原因，你可能想避免 merging with existing cookies, 你可以设置 the dont_merge_cookies key 为 True in the Request.meta.

Example of request without merging cookies:
```
request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'},
                               meta={'dont_merge_cookies': True})
```
For more info see CookiesMiddleware.
encoding (string) – the encoding of this request (defaults to 'utf-8'). This encoding will be used to percent-encode the URL and to convert the body to str (if given as unicode).
priority (int) – 这个请求的优先级 (defaults to 0). Scheduler 会使用这个优先级来定义处理请求的顺序. 有更高优先级的值的请求会更早执行. 负数值 are allowed, 可以用来表示相对较低的优先级.
dont_filter (boolean) – 表示该请求 should not be filtered by the scheduler. 当你想多次执行同一个相同的请求时，可以使用该参数 to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
errback (callable) – a function that will be called if any exception was raised while processing the request. 包括失败页面 with 404 HTTP errors and such. 该方法会接收一个 Twisted Failure 实例作为第一个参数. For more information, see Using errbacks to catch exceptions in request processing below.
flags (list) – Flags sent to the request, 可以用于打印日志 or similar purposes.

url

A string containing the URL of this request. Keep in mind that this attribute contains 转义的 URL, so it can differ from the URL passed in the constructor.

This attribute is read-only. To change the URL of a Request use replace().

method

A string representing the HTTP method in the request. This is guaranteed to be 大写. Example: "GET", "POST", "PUT", etc

headers

A 类似于字典的对象 which contains the request headers.

body

一个包含了请求体的 str.

This attribute is read-only. To change the body of a Request use replace().

meta

A dict that contains arbitrary 元数据 for this request. 对于 new Requests 来说这个字典是空的, 且该字典通常会填充不同的 Scrapy 组件 (插件, 中间件, etc). 所以这个字典中包含的数据会依赖于 the extensions you have enabled.

See Request.meta special keys for a list of special meta keys recognized by Scrapy.

当使用 copy() or replace() 方法克隆请求的时候，这个字典会被 shallow copied, 并且可以通过你的 spider 中的 response.meta 属性 be accessed.

copy()

返回一个新的 Request which is a copy of this Request. See also: Passing additional data to callback functions.

replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])

返回一个 Request 对象 with the same members, 除了 those members given new values by whichever keyword arguments are specified. The attribute Request.meta 默认会被复制 (除非 meta argument中被给予了一个新的值). See also Passing additional data to callback functions.

Passing additional data to callback functions

一个请求的回调函数是指一个方程，该方程在此请求所对应的响应被下载的时候被调用. 这个回调函数在被调用的时候，被下载的 Response 对象会成为其第一个参数.

Example:

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

有的时候你可能想传递 arguments 给这些回调函数，然后之后在第二个回调函数中你就可以接收这些 arguments. You can use the Request.meta attribute for that.

Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

Using errbacks to 在处理请求时捕获异常

一个请求的 errback 是一个方程 that will be called when an exception is raise while processing it.

It receives a Twisted Failure instance as first parameter and can be used to track 连接建立超时, DNS 错误 etc.

Here’s an example spider logging all errors and 捕获一些特殊的 errors if needed:

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

Request.meta special keys

The Request.meta 属性可以包含任何 arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.

Those are:

dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies
cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout
download_maxsize
download_latency
download_fail_on_dataloss
proxy
ftp_user (See FTP_USER for more info)
ftp_password (See FTP_PASSWORD for more info)
referrer_policy
max_retry_times

bindaddress

The IP of the outgoing IP address to use for the performing the request.

download_timeout

The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT.

download_latency

指从请求开始起，the amount of time spent to fetch the response, i.e. HTTP 消息 sent over the network. 当响应被下载下来的时候，这个 meta key 才会变得可用. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

download_fail_on_dataloss

Whether or not to fail on 断开的响应. See: DOWNLOAD_FAIL_ON_DATALOSS.

max_retry_times

The meta key is used set retry times per request. When initialized, the max_retry_times meta key 比 the RETRY_TIMES setting 优先级更高.

Request subclasses

Here is the list of built-in Request subclasses. You can also subclass it to implement your own custom functionality.

FormRequest objects (略)

Request usage examples

Using FormRequest to send data via HTTP POST

If you want to 模拟 a HTML Form POST in your spider 并且发送一系列 key-value fields, you can return a FormRequest object (from your spider) like this:

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

Using FormRequest.from_response() to 模拟用户登陆

网站经常会通过 <input type="hidden"> 元素提供 pre-populated 表单 fields, 比如 session related data or authentication tokens (在登陆页面上). 当爬取页面时，你会希望这些 fields 都是自动 pre-populated and only override a couple of them, such as the user name and password. You can use the FormRequest.from_response() method for this job. Here’s an example spider which uses it:

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

Response objects

class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

A Response 对象代表了一个 HTTP 响应, which 通常会被下载下来 (by the Downloader) 然后给到 Spiders 进行处理.

Parameters:

Parameters:	url (string) – the URL of this response status (integer) – the HTTP status of the response. Defaults to `200`. headers (dict) – 响应头. 字典的值可以是字符串 (for single valued headers) 或列表 (for multi-valued headers). body (bytes) – the response body. To access the decoded text as str (unicode in Python 2) you can use `response.text` from an encoding-aware Response subclass, such as `TextResponse`. flags (list) – 是一个包含了 `Response.flags` 属性初始值的列表. If given, the list will be shallow copied. request (`Request` object) – `Response.request` 属性的初始值. This represents the `Request` that generated this response.

url (string) – the URL of this response
status (integer) – the HTTP status of the response. Defaults to 200.
headers (dict) – 响应头. 字典的值可以是字符串 (for single valued headers) 或列表 (for multi-valued headers).
body (bytes) – the response body. To access the decoded text as str (unicode in Python 2) you can use response.text from an encoding-aware Response subclass, such as TextResponse.
flags (list) – 是一个包含了 Response.flags 属性初始值的列表. If given, the list will be shallow copied.
request (Request object) – Response.request 属性的初始值. This represents the Request that generated this response.

url

A string containing the URL of the response.

This attribute is read-only. To change the URL of a Response use replace().

status

一个代表了响应的 HTTP 状态的 Integer. Example: 200, 404.

headers

A dictionary-like object which 包含了响应头. 可以通过 get() 获取值然后返回 the first header value with the specified name 或者通过 getlist() to return all header values with the specified name. For example, this call will give you all cookies in the headers:

response.headers.getlist('Set-Cookie')

body

The body of this Response. Keep in mind that Response.body 永远是一个字节对象. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

This attribute is read-only. To change the body of a Response use replace().

request

The Request object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:

HTTP 重定向会使得原始请求 (to the URL before redirection) 被赋给被重定向的响应 (with the final URL after redirection).
Response.request.url doesn’t always equal Response.url
This attribute is only available in the spider code, and in the Spider Middlewares, but not in Downloader Middlewares (although you have the Request available there by other means) and handlers of the response_downloaded signal.

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

Unlike the Response.request attribute, the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.

Response subclasses

Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.