Scrapy from entry to abandon 5--use of middleware

Use of scrapy middleware

Insert picture description here

learning target:
  1. Method of using random UA using middleware in scrapy
  2. Method of using proxy ip in scrapy
  3. Use scrapy with selenium

1. The classification and function of scrapy middleware

1.1 Classification of scrapy middleware

According to the different locations in the scrapy running process, it is divided into:

  1. Download middleware
  2. Crawler middleware
1.2 The role of scrapy: preprocessing request and response objects
  1. Replace and process headers and cookies
  2. Use proxy ip, etc.
  3. Customize the request,

But in the case of scrapy by default, both middleware are in the middlewares.py file

The crawler middleware is used in the same way as the download middleware, and the functions are repeated, usually the download middleware is used

2. How to use download middleware:

Next, we modify and improve the Tencent recruitment crawler, and learn how to use middleware by downloading middleware.
Writing a Downloader Middlewares is the same as writing a pipeline. Define a class and then open it in the setting.

Downloader Middlewares default method:

  • process_request(self, request, spider):

    1. This method is called when each request passes through the download middleware.
    2. Return None value: If there is no return, None is returned. The request object is passed to the downloader or passed to other low-weight process_request methods through the engine
    3. Return the Response object: no longer request, return the response to the engine
    4. Return the Request object: Pass the request object to the scheduler through the engine, at this time it will not pass other low-weight process_request methods
  • process_response(self, request, response, spider):

    1. Called when the downloader completes the http request and passes the response to the engine
    2. Return to Resposne: the process_response method of the engine to the crawler or to other download middleware with lower weight
    3. Return the Request object: Pass the engine to the caller to continue the request, at this time it will not pass other low-weight process_request methods
  • Configure to enable middleware in settings.py, the smaller the weight value, the better execution

3. Define download middleware that implements random User-Agent

3.1 Improve the code in middlewares.py

import random
from Tencent.settings import USER_AGENTS_LIST # 注意导入路径,请忽视pycharm的错误提示

class UserAgentMiddleware(object):
    def process_request(self, request, spider):
        user_agent = random.choice(USER_AGENTS_LIST)
        request.headers['User-Agent'] = user_agent
        # 不写return

class CheckUA:
    def process_response(self,request,response,spider):
        print(request.headers['User-Agent'])
        return response # 不能少!

3.2 Set to enable custom download middleware in settings, the setting method is the same as the pipeline

DOWNLOADER_MIDDLEWARES = {
    
    
   'Tencent.middlewares.UserAgentMiddleware': 543, # 543是权重值
   'Tencent.middlewares.CheckUA': 600, # 先执行543权重的中间件,再执行600的中间件
}

3.3 Add UA list in settings

USER_AGENTS_LIST = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]

Run a crawler to observe the phenomenon

4. Use of proxy ip

4.1 Thinking analysis

  1. Location added by proxy: add proxyfields in request.meta
  2. Get a proxy ip and assign it torequest.meta['proxy']
    • Randomly select proxy ip in proxy pool
    • The webapi of the proxy ip sends a request to obtain a proxy ip

4.2 concrete realization

Free proxy ip:

class ProxyMiddleware(object):
    def process_request(self,request,spider):
        # proxies可以在settings.py中,也可以来源于代理ip的webapi
        # proxy = random.choice(proxies) 

        # 免费的会失效,报 111 connection refused 信息!重找一个代理ip再试
        proxy = 'https://1.71.188.37:3128' 

        request.meta['proxy'] = proxy
        return None # 可以不写return

Charged proxy ip:

# 人民币玩家的代码(使用abuyun提供的代理ip)
import base64

# 代理隧道验证信息  这个是在那个网站上申请的
proxyServer = 'http://proxy.abuyun.com:9010' # 收费的代理ip服务器地址,这里是abuyun
proxyUser = 用户名
proxyPass = 密码
proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        # 设置代理
        request.meta["proxy"] = proxyServer
        # 设置认证
        request.headers["Proxy-Authorization"] = proxyAuth

4.3 Check whether the proxy ip is available

In the case of using a proxy ip, you can handle the use of the proxy ip in the process_response() method of the download middleware. If the proxy ip cannot be used, you can replace other proxy ips

class ProxyMiddleware(object):
    ......
    def process_response(self, request, response, spider):
        if response.status != '200':
            request.dont_filter = True # 重新发送的请求对象能够再次进入队列
            return requst
Open the middleware in settings.py

5. Use selenium in middleware

Take github login as an example

5.1 Complete the crawler code

import scrapy

class Login4Spider(scrapy.Spider):
    name = 'login4'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/1596930226'] # 直接对验证的url发送请求

    def parse(self, response):
        with open('check.html', 'w') as f:
            f.write(response.body.decode())

5.2 Use selenium in middlewares.py

import time
from selenium import webdriver


def getCookies():
    # 使用selenium模拟登陆,获取并返回cookie
    username = input('输入github账号:')
    password = input('输入github密码:')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome('/home/worker/Desktop/driver/chromedriver',
                              chrome_options=options)
    driver.get('https://github.com/login')
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="login_field"]').send_keys(username)
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="password"]').send_keys(password)
    time.sleep(1)
    driver.find_element_by_xpath('//*[@id="login"]/form/div[3]/input[3]').click()
    time.sleep(2)
    cookies_dict = {cookie['name']: cookie['value'] for cookie in driver.get_cookies()}
    driver.quit()
    return cookies_dict

class LoginDownloaderMiddleware(object):

    def process_request(self, request, spider):
        cookies_dict = getCookies()
        print(cookies_dict)
        request.cookies = cookies_dict # 对请求对象的cookies属性进行替换
After setting up the middleware in the configuration file, running the crawler can see the selenium related content in the log information

summary

The use of middleware:

  1. Improve the middleware code:
  • process_request(self, request, spider):

    1. This method is called when each request passes through the download middleware.
    2. Return None value: If there is no return, None is returned. The request object is passed to the downloader or passed to other low-weight process_request methods through the engine
    3. Return the Response object: no longer request, return the response to the engine
    4. Return the Request object: Pass the request object to the scheduler through the engine, at this time it will not pass other low-weight process_request methods
  • process_response(self, request, response, spider):

    1. Called when the downloader completes the http request and passes the response to the engine
    2. Return to Resposne: the process_response method of the engine to the crawler or to other download middleware with lower weight
    3. Return the Request object: Pass the engine to the caller to continue the request, at this time it will not pass other low-weight process_request methods
  1. Need to enable middleware in settings.pyDOWNLOADER_MIDDLEWARES
    = { 'myspider.middlewares.UserAgentMiddleware': 543, }


The engine is handed over to the crawler for processing or to the process_response method of other download middleware with lower weight
3. Return Request object: Pass the engine to the caller to continue the request, at this time, it will not pass other lower weight process_request methods

  1. Need to enable middleware in settings.pyDOWNLOADER_MIDDLEWARES
    = { 'myspider.middlewares.UserAgentMiddleware': 543, }


This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/111991268