维护爬虫代理IP池--采集并验证

任务分析

我们爬的免费代理来自于https://www.kuaidaili.com这个网站。用`requests`将ip地址与端口采集过来，将`IP`与`PORT`组合成`requests`需要的代理格式，用`requests`访问`[http://ipcheck.chinahosting.tk/][1]`，并判断返回的字符串是否是代理IP，若是，则代理IP有效，若不是，则代理IP无效。

数据采集现在已经成为了基本操作了，所以大家直接看代码就可以了，注释应该写的很清楚了。如果是个新手，那么可以看这篇文章：采集wordpress并自动发布文章，这篇文章看懂了，基本上全世界大部分的网站你就都能爬了。

这个站点http://ipcheck.chinahosting.tk/是我个人搭建的用来验证IP的，详情见文章：利用虚拟主机搭建一个验证爬虫代理IP是否有效的服务，大家如果自己用的话最好搭建一个，基本上10多分钟就能搭建完，并且只要点点鼠标。

代码实现

#首先，导入必要的包
import gevent.monkey
gevent.monkey.patch_socket()
import gevent
import requests
import time
from fake_useragent import UserAgent 
from lxml import etree
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf8')

#定义GetProxy类
class GetProxy():
 #初始化一些参数
 def __init__(self):
 self.ua = UserAgent()
 self.check_url = 'http://ipcheck.chinahosting.tk/'
 self.threads = []
 self.count = 0
 
 #定义download_page函数，用来请求一个url并且返回返回值
 def download_page(self, url):
 headers = {"User-Agent":self.ua.random}
 response = requests.get(url)
 print response.status_code
 return response.content
 
 #对页面进行数据清理
 def crawl_kuaidaili(self):
 for page in xrange(1,50):
 url = 'https://www.kuaidaili.com/free/inha/' + str(page)
 response = self.download_page(url)
 soup = BeautifulSoup(response, "html.parser")
 all_tr = soup.find_all('tr')
 for tr in all_tr:
 ip = tr.find('td',attrs={"data-title":"IP"})
 port = tr.find('td',attrs={"data-title":"PORT"})
 if ip==None or port==None:
 pass
 else:
 #print "http://"+ip.get_text()+":"+port.get_text()
 self.threads.append(gevent.spawn(self.valid_check, [ip.get_text(), port.get_text()]))
 #print "add a task"
 time.sleep(1)

 #验证爬虫的有效性
 def valid_check(self, *arg):
 ip = arg[0][0]
 port = arg[0][5]
 proxyip = "http://"+ip+":"+port
 proxy={"http":proxyip}
 try:
 response = requests.get(self.check_url, proxies=proxy, timeout=5)
 #print response.content
 if str(response.content) == ip:
 print ip
 self.count = self.count + 1
 else:
 pass
 except:
 pass

 #启动爬虫
 def start(self):
 self.crawl_kuaidaili()
 gevent.joinall(self.threads)

维护爬虫代理IP池--采集并验证

任务分析

代码实现

猜你喜欢