Scrapy middleware (a)

Middleware is a core concept Scrapy inside. Middleware can reptiles before the request or initiating a request to modify the data and then returns customized to the development of crawler adapt to different situations.

"Middleware" and the Chinese name mentioned in the previous section "middleman" only one word. They do indeed very similar. Middleware and intermediaries can hijack data in the middle, make a few changes and then pass the data out. The difference is that middleware developers added to the list of active components and passive intermediary, usually maliciously added to the list of links. Middleware is mainly used to aid in the development, but many intermediaries and was used to steal data, and even forgery attacks.

There are two types of middleware in the Scrapy: Download middleware (Downloader Middleware) and reptiles middleware (Spider Middleware).

This is a downloader mainly on the first part of the middleware.

Download middleware

Scrapy official document, the interpretation of the downloaded middleware as follows.

Downloader Scrapy middleware between the request / response frame hook processing, is used to globally modify a lightweight Scrapy Request and response of the underlying system.

This presentation looks very convoluted, but in fact with easy to understand the words of expression is: replace the agency IP, replace Cookies, replace the User-Agent, automatic retry.

If there is no middleware, reptiles flow as shown in FIG.

img

After using middleware, reptiles flow as shown below.

img

Middleware development agency

In reptiles development, replacing the proxy IP is a very common situation, and sometimes each visit requires randomly select a proxy IP to.

Middleware itself is a Python class, are the first "pass" this class, it will give the agency a request to get a new IP each time you visit the site before long reptile, so that we can achieve a dynamic change agent.

After Scrapy create a project, the project will have a folder middlewares.py file, its contents after opening as shown below.

img

Scrapy automatically generated file name for this middlewares.py represents complex, behind the name s, this description file which can put a number of middleware. Scrapy automatically created this middleware is a middleware reptile, in the third article of this type will be explained. Now create a first-come automatically change the proxy IP middleware.

Add the following piece of code in the middlewares.py:

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        proxy = random.choice(settings['PROXIES'])
        request.meta['proxy'] = proxy

复制代码

To modify proxy requests, we need to add a Key in the meta request which is proxy, Value proxy IP item.

As the use of random and settings, so you need to import them at the beginning of middlewares.py:

import random
from scrapy.conf import settings
复制代码

Download middleware which has called process_request()the method that the code will be executed before each crawler access page.

Open the settings.py , first add several proxy IP:

PROXIES = ['https://114.217.243.25:8118',
          'https://125.37.175.233:8118',
          'http://1.85.116.218:8118']
复制代码

It should be noted that there is a type of proxy IP, you need to look at is the type of HTTP or HTTPS proxy IP type of the proxy IP. If wrong, it will lead to inaccessible.

Activate middleware

After middleware written, you need to start settings.py. Find the following section of this statement is annotated in the settings.py:

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'AdvanceSpider.middlewares.MyCustomDownloaderMiddleware': 543,
#}
复制代码

Uncommented and modified to reference ProxyMiddleware. change into:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
}
复制代码

This is actually a dictionary whose Key is separated by intermediate path points, the number indicates the order of such middleware. Because middleware is run sequentially, so if you encounter a situation after middleware, middleware is crucial sequence before a middleware dependency.

How to determine the latter figure should be how to write it? The easiest way is to start from the 543, plus a gradual, so generally a big problem does not occur. If you want to do more professional intermediate point, it needs to know the order of carrying Scrapy middleware, as shown in the FIG.

img

The lower the number the more the first implementation of middleware, such as the first one that comes Scrapy middleware RobotsTxtMiddleware, its role is to first check settings.py in ROBOTSTXT_OBEYthis one configuration is Truestill False. If so True, pledged to comply with Robots.txt protocol, it will check the URL to be accessed can be accessed run, if not allowed to access, then direct the request to cancel this time, the next request and the various related all the operations do not need to continue.

Custom middleware developers, will be inserted into Scrapy own middleware sequentially. Crawler running all middleware will follow the order from 100 to 900 sequentially. Until all the middleware entire run is complete, or face certain middleware and canceled the request.

Scrapy actually comes UA middleware (UserAgentMiddleware), Acting middleware (HttpProxyMiddleware) and retry middleware (RetryMiddleware). So, "in principle," he said, it has to develop these three middleware, you need to disable Scrapy which comes with three middleware. To disable Scrapy middleware, you need to order this middleware is set to None in settings.py inside:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
  'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None
}
复制代码

Why say "in principle" should disable it? First check comes Scrapy agent middleware source code, as shown below:

img

As can be seen from the figure, if Scrapy found that the request has been set up a proxy, then the middleware will do nothing directly returned. Thus, although the agent comes Scrapy middleware order of 750, is larger than the order of proxy middleware developers to customize the 543, but it does not overwrite the proxy information the developer's own definition, even if can not help that comes with the system this agent middleware does not matter.

Full activation of self-defined part of the middleware settings.py shown below.

img

After running reptile configured, the reptiles would have set up a proxy randomly before each request. To test the proxy middleware operating results, this exercise can use the following page:

http://exercise.kingname.info/exercise_middleware_ip
复制代码

This page will return the IP address of the crawler open directly on the page, as shown in FIG.

img

This exercise supports page flip function, add to the URL "/ pages" to flip. For example URLs on page 100:

http://exercise.kingname.info/exercise_middleware_ip/100
复制代码

使用了代理中间件为每次请求更换代理的运行结果,如下图所示。

img

代理中间件的可用代理列表不一定非要写在settings.py里面,也可以将它们写到数据库或者Redis中。一个可行的自动更换代理的爬虫系统,应该有如下的3个功能。

  1. 有一个小爬虫ProxySpider去各大代理网站爬取免费代理并验证,将可以使用的代理IP保存到数据库中。
  2. 在ProxyMiddlerware的process_request中,每次从数据库里面随机选择一条代理IP地址使用。
  3. 周期性验证数据库中的无效代理,及时将其删除。 由于免费代理极其容易失效,因此如果有一定开发预算的话,建议购买专业代理机构的代理服务,高速而稳定。

开发UA中间件

开发UA中间件和开发代理中间件几乎一样,它也是从settings.py配置好的UA列表中随机选择一项,加入到请求头中。代码如下:

class UAMiddleware(object):

    def process_request(self, request, spider):
        ua = random.choice(settings['USER_AGENT_LIST'])
        request.headers['User-Agent'] = ua
复制代码

比IP更好的是,UA不会存在失效的问题,所以只要收集几十个UA,就可以一直使用。常见的UA如下:

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  "Dalvik/1.6.0 (Linux; U; Android 4.2.1; 2013022 MIUI/JHACNBL30.0)",
  "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; HUAWEI MT7-TL00 Build/HuaweiMT7-TL00) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  "AndroidDownloadManager",
  "Apache-HttpClient/UNAVAILABLE (java 1.4)",
  "Dalvik/1.6.0 (Linux; U; Android 4.3; SM-N7508V Build/JLS36C)",
  "Android50-AndroidPhone-8000-76-0-Statistics-wifi",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.4; MI 3 MIUI/V7.2.1.0.KXCCNDA)",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.2; Lenovo A3800-d Build/LenovoA3800-d)",
  "Lite 1.0 ( http://litesuits.com )",
  "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
  "Mozilla/5.0 (Linux; U; Android 4.1.1; zh-cn; HTC T528t Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360browser(securitypay,securityinstalled); 360(android,uppayplugin); 360 Aphone Browser (2.0.4)",
]
复制代码

配置好UA以后,在settings.py下载器中间件里面激活它,并使用UA练习页来验证UA是否每一次都不一样。练习页的地址为:

http://exercise.kingname.info/exercise_middleware_ua。 
复制代码

UA练习页和代理练习页一样,也是可以无限制翻页的。

运行结果如下图所示。

img

开发Cookies中间件

对于需要登录的网站,可以使用Cookies来保持登录状态。那么如果单独写一个小程序,用Selenium持续不断地用不同的账号登录网站,就可以得到很多不同的Cookies。由于Cookies本质上就是一段文本,所以可以把这段文本放在Redis里面。这样一来,当Scrapy爬虫请求网页时,可以从Redis中读取Cookies并给爬虫换上。这样爬虫就可以一直保持登录状态。

以下面这个练习页面为例:

http://exercise.kingname.info/exercise_login_success
复制代码

如果直接用Scrapy访问,得到的是登录界面的源代码,如下图所示。

img

现在,使用中间件,可以实现完全不改动这个loginSpider.py里面的代码,就打印出登录以后才显示的内容。

First, the development of a small program, Selenium by this login page, and save the site and return to the Headers in Redis. The applet code is shown below.

img

The role of this code is to use the Selenium and ChromeDriver fill in your username and password to log in to practice to achieve the page, and then convert the subsequent logons Cookies as JSON-formatted string and save it to the Redis.

Next, write a middleware for reading the Cookies from Redis, and this use Cookies to Scrapy:

class LoginMiddleware(object):
    def __init__(self):
        self.client = redis.StrictRedis()
    
    def process_request(self, request, spider):
        if spider.name == 'loginSpider':
            cookies = json.loads(self.client.lpop('cookies').decode())
            request.cookies = cookies
复制代码

After setting up this middleware, reptiles inside the code does not need to make any changes to be successful to get HTML logged in to see the future, as shown in 12-12.

img

If there are 100 accounts a site, then write a single program, continuing to use Selenium and Selenium and PhantomJS ChromeDriver or login to get Cookies, Cookies and stored in the Redis. Reptile each visit a new reading of the Cookies from Redis to crawl, greatly reducing the risk of detection sites or blocked.

This approach not only applies to login, also applies to the verification process code.

This one will stop here, the next one, we will explain how to integrate Selenium in downloader middleware, request retries, and exception handling.

Author: Qingnan
link: https://juejin.im/post/5bf20ac551882578cc546841
Source: Nuggets
copyright reserved by the authors. Commercial reprint please contact the author authorized, non-commercial reprint please indicate the source.

Guess you like

Origin www.cnblogs.com/duhy/p/12071950.html