Download middleware framework scrapy

Introduction

  Middleware is a core concept Scrapy inside. Middleware can reptiles before the request or initiating a request to modify the data and then returns customized to the development of crawler adapt to different situations.

"Middleware" and the Chinese name mentioned in the previous section "middleman" only one word. They do indeed very similar. Middleware and intermediaries can hijack data in the middle, make a few changes and then pass the data out. The difference is that middleware developers added to the list of active components and passive intermediary, usually maliciously added to the list of links. Middleware is mainly used to aid in the development, but many intermediaries and was used to steal data, and even forgery attacks.

There are two types of middleware in the Scrapy: Download middleware (Downloader Middleware) and reptiles middleware (Spider Middleware).

Scrapy official document, the interpretation of the downloaded middleware as follows.

Downloader Scrapy middleware between the request / response frame hook processing, is used to globally modify a lightweight Scrapy Request and response of the underlying system.

This presentation looks very convoluted, but in fact with easy to understand the words of expression is: replace the agency IP, replace Cookies, replace the User-Agent, automatic retry.

If there is no middleware, reptiles flow as shown in FIG.

 

After using middleware, reptiles flow as shown below.

Agent Middleware

  In reptiles development, replacing the proxy IP is a very common situation, and sometimes each visit requires randomly select a proxy IP to.

Middleware itself is a Python class, are the first "pass" this class, it will give the agency a request to get a new IP each time you visit the site before long reptile, so that we can achieve a dynamic change agent.

After Scrapy create a project, the project will have a middlewares.py folder file to open later which reads as follows

from scrapy import signals

class MeiziDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Scrapy automatically generated file name indicates that this complex is middlewares.py, behind the name s, this description file which can put a number of middleware. Scrapy automatically created this middleware is a middleware reptile, in the third article of this type will be explained. Now create a first-come automatically change the proxy IP middleware.

Add the following piece of code in the middlewares.py:

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        proxy = random.choice(settings['PROXIES'])
        request.meta['proxy'] = proxy

To modify proxy requests, we need to add a Key in the meta request which is proxy, Value proxy IP item.

As the use of random and settings, so you need to import them at the beginning of middlewares.py:

import random
from scrapy.conf import settings

Download middleware which has called process_request()the method that the code will be executed before each crawler access page.

Open the settings.py, first add several proxy IP:

PROXIES = ['https://114.217.243.25:8118',
          'https://125.37.175.233:8118',
          'http://1.85.116.218:8118']

It should be noted that there is a type of proxy IP, you need to look at is the type of HTTP or HTTPS proxy IP type of the proxy IP. If wrong, it will lead to inaccessible.

Activate middleware

After middleware written, you need to start settings.py. Find the following section of this statement is annotated in the settings.py:

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'AdvanceSpider.middlewares.MyCustomDownloaderMiddleware': 543,
#}

Uncommented and modified to reference ProxyMiddleware. change into:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
}

This is actually a dictionary whose Key is separated by intermediate path points, the number indicates the order of such middleware. Because middleware is run sequentially, so if you encounter a situation after middleware, middleware is crucial sequence before a middleware dependency.

How to determine the latter figure should be how to write it? The easiest way is to start from the 543, plus a gradual, so generally a big problem does not occur. If you want middleware to do a little more professional, you would need to know the order comes Scrapy middleware, as shown below.

DOWNLOADER_MIDDLEWARES_BASE
{
    'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
    'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
    'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
    'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}

The lower the number the more the first implementation of middleware, such as the first one that comes Scrapy middleware RobotsTxtMiddleware, its role is to first check settings.py in ROBOTSTXT_OBEYthis one configuration is Truestill False. If so True, pledged to comply with Robots.txt protocol, it will check the URL to be accessed can be accessed run, if not allowed to access, then direct the request to cancel this time, the next request and the various related all the operations do not need to continue.

Custom middleware developers, will be inserted into Scrapy own middleware sequentially. Crawler running all middleware will follow the order from 100 to 900 sequentially. Until all the middleware entire run is complete, or face certain middleware and canceled the request.

Scrapy actually comes UA middleware (UserAgentMiddleware), Acting middleware (HttpProxyMiddleware) and retry middleware (RetryMiddleware). So, "in principle," said, it has to develop these three middleware, you need to disable Scrapy which comes with three middleware. To disable Scrapy middleware, you need to order this middleware is set to None in settings.py inside:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
  'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None
}

After running reptile configured, the reptiles would have set up a proxy randomly before each request. To test the proxy middleware operating results, this exercise can use the following page:

http://httpbin.org/get

This page will return the IP address of reptiles;

Case presentation:

Free Agent: http://www.goubanjia.com/

import scrapy
import json

class MeinvSpider(scrapy.Spider):
    name = 'meinv'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
     
        str_info = response.body.decode()
        dic_info = json.loads(str_info)
        print(dic_info["origin"])
spider
import random
from meizi.settings import PROXIES
class ProxyMiddleware(object):

    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        request.meta['proxy'] = proxy

        return None
Middlewares
# Agent pool 
PROXIES = [ ' http://117.191.11.102:80 ' ,
           ' http://117.191.11.107:80 ' ,
           ' http://117.191.11.72:8080 ' ] 

# proxy enabling middleware 
DOWNLOADER_MIDDLEWARES = {
    ' meizi.middlewares.ProxyMiddleware ' : 543 , 
}
settings

Enter information

UA Middleware

UA developing agent to develop middleware middleware and almost the same, it is also good from the UA list settings.py arranged randomly select one added to the request header. code show as below:

class UAMiddleware(object):

    def process_request(self, request, spider):
        ua = random.choice(settings['USER_AGENT_LIST'])
        request.headers['User-Agent'] = ua

Better than IP, UA will not fail problem, so long as the collection of dozens of UA, you can always use. Common UA ​​as follows:

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  "Dalvik/1.6.0 (Linux; U; Android 4.2.1; 2013022 MIUI/JHACNBL30.0)",
  "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; HUAWEI MT7-TL00 Build/HuaweiMT7-TL00) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  "AndroidDownloadManager",
  "Apache-HttpClient/UNAVAILABLE (java 1.4)",
  "Dalvik/1.6.0 (Linux; U; Android 4.3; SM-N7508V Build/JLS36C)",
  "Android50-AndroidPhone-8000-76-0-Statistics-wifi",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.4; MI 3 MIUI/V7.2.1.0.KXCCNDA)",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.2; Lenovo A3800-d Build/LenovoA3800-d)",
  "Lite 1.0 ( http://litesuits.com )",
  "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
  "Mozilla/5.0 (Linux; U; Android 4.1.1; zh-cn; HTC T528t Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360browser(securitypay,securityinstalled); 360(android,uppayplugin); 360 Aphone Browser (2.0.4)",
]

After configured UA, activate it in settings.py downloader middleware inside, and practice using the UA UA page to verify each is different. Exercise page address is:

http://httpbin.org/get

Case presentation:

import scrapy
import json

class MeinvSpider(scrapy.Spider):
    name = 'meinv'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):

        str_info = response.body.decode()
        dic_info = json.loads(str_info)
        print(dic_info["headers"]['User-Agent'])
spider
import random
from meizi.settings import USER_AGENT_LIST
class UAMiddleware(object):

    def process_request(self, request, spider):
        ua = random.choice(USER_AGENT_LIST)
        request.headers['User-Agent'] = ua

        return None
Middlewares
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  "Dalvik/1.6.0 (Linux; U; Android 4.2.1; 2013022 MIUI/JHACNBL30.0)",
  "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; HUAWEI MT7-TL00 Build/HuaweiMT7-TL00) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  "AndroidDownloadManager",
  "Apache-HttpClient/UNAVAILABLE (java 1.4)",
  "Dalvik/1.6.0 (Linux; U; Android 4.3; SM-N7508V Build/JLS36C)",
  "Android50-AndroidPhone-8000-76-0-Statistics-wifi",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.4; MI 3 MIUI/V7.2.1.0.KXCCNDA)",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.2; Lenovo A3800-d Build/LenovoA3800-d)",
  "Lite 1.0 ( http://litesuits.com )",
  "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
  "Mozilla/5.0 (Linux; U; Android 4.1.1; zh-cn; HTC T528t Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360browser(securitypay,securityinstalled); 360(android,uppayplugin); 360 Aphone Browser (2.0.4)",
]

DOWNLOADER_MIDDLEWARES = {
   'meizi.middlewares.ProxyMiddleware': None,
   'meizi.middlewares.UAMiddleware': 543,
}
settings

Enter the result:

Cookies Middleware

For sites that require login, you can use Cookies to stay logged in. So if you write a small program alone, with Selenium continuing with a different account login site, you can get a lot of different Cookies. Due to the nature of Cookies it is a text, so you can put this text on Redis inside. Thus, when the request Scrapy web crawler, Cookies can be read from and to the crawler Redis replaced. Such reptiles can remain logged on.

This exercise the following page as an example:

http://exercise.kingname.info/exercise_login_success

If Scrapy directly access, it is obtained login interface source code, as shown below.

 

Guess you like

Origin www.cnblogs.com/songzhixue/p/11335866.html