多线程爬取代理并验证

前言

在反爬虫中最常见的一种手段就是，判断你的请求频率。如果你在短时间内发送了大量的请求，不管你是不是人，先封你账号或者ip一段时间。这时为了达到自己爬虫的目的就需要使用代理IP，用它来伪装自己。下面就用多线程爬取代理ip并且验证。

分析

本次的免费代理ip的来源是西刺免费代理IP,请看下图：
目标网站
源码分析：由下图，我选择用xpath直接提取，和BeautifulSoup相比，xpath不但快而且还简介，代码量少。
在这里插入图片描述

代码

抓取代码：

def get_info(Queue,flag):
    while Queue:
        url=Queue.get()
        txt=requests.get(url,headers=headers).text
        html=etree.HTML(txt)
        ip=html.xpath('//tr[@class=""]/td[2]/text()')
        for i in ip:
            Queue3.put([i,flag])
        yz(Queue3)

验证代码：

def yz(Queue):
    while Queue:
        cc=Queue.get()
        ip,flag=cc[0],cc[1]
        try:
            proxies={flag:ip}
            response=requests.get('https://www.baidu.com',proxies=proxies,timeout=2) if flag=='http' else requests.get('http://www.baidu.com',proxies=proxies,timeout=2)
            if response.status_code ==200:
                print(flag,ip,'yes')
            else:
                print(flag,ip,'no')
        except Exception as e:
            print(e)

完整代码：

import requests
from lxml import etree
import queue
import threading

Queue1=queue.Queue(23)
Queue2=queue.Queue(18)
Queue3=queue.Queue(10000)
for i in range(1,10):
    Queue1.put("https://www.xicidaili.com/wt/%d"%i)  #将ip代理网页放入队列中，便于后续使用多线程
for i in range(1,10):
    Queue2.put("https://www.xicidaili.com/wn/%d"%i)

headers={'User-Agent': 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5',}

def yz(Queue):
    while Queue:
        cc=Queue.get()
        ip,flag=cc[0],cc[1]
        try:
            proxies={flag:ip}
            response=requests.get('https://www.baidu.com',proxies=proxies,timeout=2) if flag=='http' else requests.get('http://www.baidu.com',proxies=proxies,timeout=2)
            if response.status_code ==200:
                print(flag,ip,'yes')
            else:
                print(flag,ip,'no')
        except Exception as e:
            print(e)

def get_info(Queue,flag):
    while Queue:
        url=Queue.get()
        txt=requests.get(url,headers=headers).text
        html=etree.HTML(txt)
        ip=html.xpath('//tr[@class=""]/td[2]/text()')
        for i in ip:
            Queue3.put([i,flag])
        yz(Queue3)


if __name__ == '__main__':
    for i in range(3):
        th=threading.Thread(target=get_info,args=[Queue1,'http'])
        th.start()
    for i in range(3):
        td=threading.Thread(target=get_info,args=[Queue2,'https'])
        td.start()

效果截图

在这里插入图片描述

反思总结

爬虫的目的是爬取有用信息，对于不需要的信息不要爬取，这样可以节约时间，同时提升效率。
写爬虫的时候要考虑常见的反爬虫策略，这样可以省去后面因为反爬虫的更改时间。
控制欲望，分析网站。本次的网站有上万个ip代理，这些都获取，明显是不明智地，本次我也犯了这个错误。仔细分析网站，是因为我发现后面地ip验证时间，有些都是2016年了，至于我，应该选择最新的。不是怀疑其没有用，只是没有必要。
获取代理，除了这一种还可以购买ip，但是网上说购买的ip稳定性不好。而免费的ip有时可能失效。除此之外就是使用api自动获取ip了，但是这样也有缺陷，所使用的api对应的网站有介绍。

稳在前

发布了19 篇原创文章 · 获赞 0 · 访问量 271

私信关注

多线程爬取代理并验证

前言

分析

代码

效果截图

反思总结

猜你喜欢