爬虫代理池设置===闲的无聊

代理池的设置：
代理服务tinyproxy的基本设置
安装：
apt install tinyproxy
配置：
vim /etc/tinyproxy.conf
修改其中的两项配置，首先，将这一行注释掉

# Allow 127.0.0.1

然后，修改一下默认端口号

Port XXXX   (自定义）

重启一下tinyproxy

sudo systemctl restart tinyproxy  
# 支持ubuntu16

如果使用的是云服务器，需要添加一下安全组规则：
设置端口范围为：1703/1703，允许访问的IP来源设置为0.0.0.0/0
测试tinyproxy是否可用：
在项目下进入 scrapy shell
（运行scrapy shell时需切换到项目.cfg 配置同级目录下
在shell里运行：
import requests
requests.get(‘http://httpbin.org/ip’,proxies={‘http’:'http://主机：端口’}）.json()
返回结果为你的代理IP则正常
非分布式代理池设置：
middlewares文件内设置

import random
from scrapy import signals
from scrapy.exceptions import NotConfigured
class RandomProxyMiddleware(object):
    def __init__(self,settings):#导入代理池
        self.proxies = settings.getlist('PROXIES')
    @classmethod
    def from_crawler(cls,crawler):#导入中间件
        if crawler.settings.getbool('HTTPCACHE_ENABLED'):
            raise NotConfigured
        return cls(crawler.settings)
    def process_request(self,request,spider):
        if 'proxy' not in  request.meta:
            request.meta['proxy'] = random.choices(self.proxies)
    def process_response(self,request,response,spider):
        print(request.meta['proxy'])
    def process_exception(self,request,exception,spider):
        pass

#setting文件设置

DOWNLOADER_MIDDLEWARES = {
   'xpc.middlewares.XpcDownloaderMiddleware': 749,
}#代理池#系统的是750  打开这个文件

HTTPCACHE_ENABLED = True#系统文件需要打开
PROXIES = {
   'http://xxxxxxxxxxxxxx：端口号',
   'http://xxxxxxxxxxxxxx：端口号',
   'http://xxxxxxxxxxxxxx：端口号',
   'http://xxxxxxxxxxxxxx：端口号',
   'http://xxxxxxxxxxxxxx：端口号',
   'http://xxxxxxxxxxxxxx：端口号',
}

#不知道其他人怎么设置的，谢谢支持！

爬虫代理池设置===闲的无聊

猜你喜欢