Learning python-day02-03 --- Transfer from Python to create a distributed search engine crawlers Scrapy succinctly

The third section of Article 147, Python distributed search engine crawlers to build Scrapy succinctly - random replacement of the browser user-agent user agent by downloadmiddleware

Introduction downloadmiddleware
Middleware is a frame, may be connected to the request / response process. This is a very light, low-level system that can change Scrapy requests and responses. Requests that is intermediate between the request and the response Response, Requests can be globally modified in response to the request and Response

UserAgentMiddleware () method, the default middleware

UserAgentMiddleware at source in downloadmiddleware in the useragent.py () method, the default middleware

We can see from the source as the default User-Agent request Requests are Scrapy, this is very easy to identify and intercept reptile website
Here Insert Picture Description

We can modify the default middleware UserAgentMiddleware () to randomly replace Requests request header information User-Agent browser user agent

The first step in settings.py configuration file, open middleware registration

DOWNLOADER_MIDDLEWARES={ }

The default will be the default UserAgentMiddleware set to None, or set to maximum on the final execution, we modify the default custom middleware will be the first implementation of user_agent

settings.py profile

复制代码
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {              #开启注册中间件
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, #将默认的UserAgentMiddleware设置为None
}
复制代码

The second step, install the browser user agent module fake-useragent 0.1.7

fake-useragent camouflage crawler is a dedicated browser module User-Agent request header. This module online maintenance of the various versions of each browser library, provided we use

Various online browser information: http: //fake-useragent.herokuapp.com/browsers/0.1.7 0.1.7 version, fake-useragent randomly to invoke the browser proxy here

First install this module

pip install fake-useragent

Instructions for use:

复制代码
#!/usr/bin/env python
# -*- coding:utf8 -*-

from fake_useragent import UserAgent  #导入浏览器代理模块
ua = UserAgent()                      #实例化浏览器代理类

ua.ie                                 #随机获取IE类型的代理
# Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US);
ua.msie                               #随机获取msie类型的代理,下面的相同
# Mozilla/5.0 (compatible; MSIE 10.0; Macintosh; Intel Mac OS X 10_7_3; Trident/6.0)'
ua['Internet Explorer']
# Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.4; InfoPath.2; SV1; .NET CLR 3.3.69573; WOW64; en-US)
ua.opera
# Opera/9.80 (X11; Linux i686; U; ru) Presto/2.8.131 Version/11.11
ua.chrome
# Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 Safari/537.2'
ua.google
# Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.13 (KHTML, like Gecko) Chrome/24.0.1290.1 Safari/537.13
ua['google chrome']
# Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11
ua.firefox
# Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1
ua.ff
# Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1
ua.safari
# Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25

# and the best one, random via real world browser usage statistic
ua.random                               #随机获取各种浏览器类型的代理,
复制代码

More use https://pypi.python.org/pypi/fake-useragent/0.1.7

The third step, custom middleware global random replacement request Requests User-Agent header information Browser User-Agent

In middlewares.py file, custom middleware

复制代码
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from fake_useragent import UserAgent    #导入浏览器用户代理模块

class RequestsUserAgentmiddware(object):                                    #自定义浏览器代理中间件
    #随机更换Requests请求头信息的User-Agent浏览器用户代理
    def __init__(self,crawler):
        super(RequestsUserAgentmiddware, self).__init__()                   #获取上一级父类基类的,__init__方法里的对象封装值
        self.ua = UserAgent()                                               #实例化浏览器用户代理模块类
        self.ua_type = crawler.settings.get('RANDOM_UA_TYPE','random')      #获取settings.py配置文件里的RANDOM_UA_TYPE配置的浏览器类型,如果没有,默认random,随机获取各种浏览器类型

    @classmethod                                                            #函数上面用上装饰符@classmethod,函数里有一个必写形式参数cls用来接收当前类名称
    def from_crawler(cls, crawler):                                         #重载from_crawler方法
        return cls(crawler)                                                 #将crawler爬虫返回给类

    def process_request(self, request, spider):                             #重载process_request方法
        def get_ua():                                                       #自定义函数,返回浏览器代理对象里指定类型的浏览器信息
            return getattr(self.ua, self.ua_type)
        request.headers.setdefault('User-Agent', get_ua())                  #将浏览器代理信息添加到Requests请求
复制代码

The fourth step, our custom middleware registered to settings.py configuration file, in the DOWNLOADER_MIDDLEWARES

Note that the default should UserAgentMiddleware middleware set to None, so our custom middleware into force

复制代码
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {              #开启注册中间件
   'adc.middlewares.RequestsUserAgentmiddware': 543,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, #将默认的UserAgentMiddleware设置为None
}
复制代码

We can break point to debug it and see if the entry into force
Here Insert Picture Description

Rationale Figure

Here Insert Picture Description

Published 27 original articles · won praise 0 · Views 221

Guess you like

Origin blog.csdn.net/u013683613/article/details/104369261