一.前言:
采集数据时,难免会遇到各种反爬机制,例如封IP就是很让人头疼的问题。
封IP又分为两种情形:
情形一:访问时间间隔短,访问太快,网站限制访问,直接提示访问频率太高,网站返回不是你所要请求的内容;
情形二:直接封禁IP,无法访问
今天我们就来解决网站封IP的问题。解决方法,就是使用代理IP(proxy),目前网上有许多代理ip,有免费的也有付费的。免费的虽然不用花钱但也相对不稳定,存活时间短,想要实用、方便,就得搭建自己的IP池,且时常维护。
二.搭建自己的代理IP池:
1.整体逻辑思路:
抓取大量IP--->将IP存储--->取IP(有效使用,无效删除取下一个)--->使用取出的有效IP
2.先寻找存放IP的容器:
就将IP存放到mysql数据库中吧,建库(ippool),建表(project_ip)
字段解释:
ip:ip地址
port:ip端口
speed:访问速度
proxy_type:ip类型,http,https等
ID:id号,很简单的一张表
建表sql:
/*
Navicat Premium Data Transfer
Source Server : localhost
Source Server Type : MySQL
Source Server Version : 80013
Source Host : localhost:3306
Source Schema : ippool
Target Server Type : MySQL
Target Server Version : 80013
File Encoding : 65001
*/
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for project_ip
-- ----------------------------
DROP TABLE IF EXISTS `project_ip`;
CREATE TABLE `project_ip` (
`ip` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`port` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`speed` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`proxy_type` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`ID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`ID`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci ROW_FORMAT = Dynamic;
SET FOREIGN_KEY_CHECKS = 1;
3. 以防万一,先找几个代理IP来用来抓取(过来人的建议,一定要有,不然...):
4.以某刺代理为例,先到网站对其分析:
具体分析内容有:数据请求方式,数据加载方式,怎么获取到相应内容,对加载数据的精确抓取;
分析后得出的结论:get请求,所需数据直接在加载的H5页面中,直接用xpath提取即可
5.编写代码,抓取IP,存入数据库:
crawlAllIp.py
import requests
from scrapy.selector import Selector
import pymysql
import random
from time import *
#链接数据库
conn = pymysql.connect(host='127.0.0.1', user='root', passwd='jason!2@li&*', db='ippool', charset='utf8mb4')
cursor = conn.cursor()
def crawl_ips():
for i in range(1,10):#先抓10页,1000个了
sleeptime = random.choice([1, 2, 3, 4, 5, 6, 6, 7]) # 爬取应当间隔,不然网站容易封ip
print(sleeptime)
sleep(sleeptime)
#构建一个随机请求头,也可直接用python fake_useragent库
headers = {
"User-Agent": random.choice(
['Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'])}
#随便找几个代理IP,真的得找,我给的很大可能性已失效
proxies = random.choice([
{"http": "http://36.248.133.35:9999"},
{"http": "http://125.123.120.109:9999"},
{"http": "http://125.123.126.249:9999"},
])
print(proxies)
res = requests.get('https://www.xicidaili.com/nn/'+str(i),proxies=proxies,headers=headers)
# print(res.text)
Selectora = Selector(res)
all_trs = Selectora.xpath('//table[@id="ip_list"]/tr')
ip_list = []
for tr in all_trs[1:]:
spend_str = tr.xpath('./td/div[@class="bar"]/@title').extract()[0] ##提取速度
if spend_str:
speed = float(spend_str.split('秒')[0])
all_text = tr.xpath('./td/text()').extract()
print(all_text)
ip = all_text[0]
port = all_text[1]
proxy_type = all_text[5]
ip_list.append((ip, port, speed, proxy_type))
for ip_info in ip_list:
print(ip_info)
insert_sql = """insert project_ip(ip,port,speed,proxy_type) VALUES('{0}','{1}','{2}','{3}')""".format(
ip_info[0], ip_info[1], ip_info[2],ip_info[3]
)
print(insert_sql)
cursor.execute(insert_sql)
conn.commit()
if __name__ == "__main__":
crawl_ips()
conn.close()
cursor.close()
至此,抓取结束,已将IP存入数据库
6.获取一个有效的IP:
思路是从库里随机抽取一个IP,用该IP去请求 https://www.baidu.com/,若是出现异常,说明IP无效,调用delete_ip函数将该ip删除,若无异常,再判断返回的状态码,若是介于200-400之间,说明ip正常,反之,调用delete_ip函数将该ip删除。最后,返回获取到的这个有效的IP
getOneIp.py
import requests
from scrapy.selector import Selector
import pymysql
from time import *
conn = pymysql.connect(host='127.0.0.1', user='root', passwd='jason!2@li&*', db='ippool', charset='utf8mb4')
cursor = conn.cursor()
class GetIP(object):
def delete_ip(self, ip):
# 从数据库中删除无效的ip
delete_sql = """delete from project_ip where ip='{0}'""".format(ip)
cursor.execute(delete_sql)
conn.commit()
return True
def judge_ip(self, ip, port, proxy_type):
# 判断一个ip是否可用
http_url = 'https://www.baidu.com/'
proxy_url = '{0}://{1}:{2}'.format(str(proxy_type).lower(),ip, port)
try:
proxy_dict = {
'http': proxy_url,
}
requests.get(http_url, proxies=proxy_dict)
return True
except Exception as e:
print("ip出现异常")
# 出现异常后就把这个ip给删除掉
self.delete_ip(ip)
return False
else:
code = response.status_code
if code >= 200 and code < 300:
print('effective ip')
return True
else:
print('invalid')
self.delete_ip(ip)
return False
def get_random_ip(self):
# 从数据库中随机获取到一个可用的ip
random_sql = """SELECT ip,port,proxy_type FROM project_ip ORDER BY RAND() LIMIT 1"""
result = cursor.execute(random_sql)
for ip_info in cursor.fetchall():
ip = ip_info[0]
port = ip_info[1]
proxy_type = ip_info[2]
judge_re = self.judge_ip(ip, port, proxy_type)
print(ip,port)
if judge_re: # 如果返回True
return "{0}://'{1}':'{2}'".format(proxy_type,ip, port)
else:
return self.get_random_ip()
if __name__ == "__main__":
ip = GetIP().get_random_ip()
print(ip)
7.入库后,如果需要多次调用,多取几个有效IP就行了
Last_Get_One_Effective_Ip.py
from getOneIp import *
if __name__ == "__main__":
get_ip = GetIP()
effectiveIp = get_ip.get_random_ip()
print(effectiveIp)
想要长期使用,可以将代码部署,定时采集检测,保证IP的有效性