Article directory
Build proxy IP pool
(1) IP source
After understanding the value of the proxy IP and its port, we know that we must have a certain number of available IPs to complete the crawling of large amounts of data.
But where does the proxy IP come from?
- Pay to get the corresponding service from the agency website
- Build your own free IP proxy pool
The IP proxy pool built by itself can meet most of the needs
If you need to be a professional crawler, it is recommended to find some high-quality websites to purchase stable services.
(2) Preliminary collection of IP
First give a few free proxy IP URLs
https://www.kuaidaili.com/
http://www.66ip.cn/index.html
http://www.ip3366.net/
https://www.89ip.cn/index_1
Open the 89 free proxy webpage and we quickly found the information we need - IP and its port
Analyze and obtain the page content through XPATH
The more URLs provided, the more IPs are crawled. The same XPATH can request different pages of the same website.
But XPATH needs to be rewritten for different websites.
url = 'https://www.89ip.cn/index_1.html'
html = requests.get(url=url, headers=headers)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//div[@class="layui-form"]//tr/td[1]/text()')
post_list = tree.xpath('//div[@class="layui-form"]//tr/td[2]/text()')
Crawled 25 IPs from this page
180.165.133.13 : 53281
36.137.70.178 : 7777
27.42.168.46 : 55481
47.105.91.226 : 8118
221.122.91.61 : 80
183.247.202.230 : 30001
183.154.220.72 : 9000
171.92.20.37 : 9000
171.92.21.168 : 9000
223.10.18.173 : 8118
183.247.215.218 : 30001
222.174.11.87 : 7890
183.222.217.168 : 9091
182.139.111.125 : 9000
60.211.218.78 : 53281
220.170.145.103 : 7302
183.247.199.114 : 30001
218.1.142.142 : 57114
222.64.153.165 : 9000
61.61.26.181 : 80
218.28.141.66 : 8001
223.94.85.131 : 9091
221.178.239.200 : 7302
182.139.110.124 : 9000
43.248.133.29 : 8080
At this time, the use of dictionary storage instead of list is for deduplication
Preventing the same IP from being written repeatedly increases the burden.
(3) Usability inspection
The IP improved by free proxy is often of low quality to ensure the efficiency of subsequent use
After completing the initial collection of IPs, we need to check the usability of these IPs.
Visit http://httpbin.org/ip to observe the IP used for the current visit
As the result of my visit is
{
"origin": "223.104.40.44"
}
Then we use the corresponding proxy IP to request the web page
After obtaining the response result, compare it with the incoming proxy IP to know whether the proxy is successful.
def test(ip, port):
# 如果代理成功 则页面解析获取的IP应当与输入IP相同
# True 代理成功 False代理失败
print('开始测试' + str(ip) + '...')
url = 'http://httpbin.org/ip'
proxies = {
"http": f"http://{
ip}:{
port}", "https": f"http://{
ip}:{
port}"}
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
html = getHTMLText(url=url, headers=headers, data=None, proxies=proxies)
if html == "GET异常":
return False
return parse(html)[0] == ip
def test_list(ip_dic):
ip_list = list(ip_dic.keys())
for num in range(len(ip_list)):
if test(ip_list[num], ip_dic[ip_list[num]]):
print(str(ip_list[num]) + '有效')
else:
print(str(ip_list[num]) + '无效')
ip_dic.pop(ip_list[num])
return ip_dic
Randomly perform two IP tests:
ip_dic = {
'101.200.127.149': '3129',
'58.220.95.114': '10053'
}
test_list(ip_dic)
operation result:
开始测试101.200.127.149...
101.200.127.149有效
开始测试58.220.95.114...
58.220.95.114无效
(4) IP pool storage display
Store the verified IP and its port locally for other crawlers to call
There are many storage methods: mysql, txt, excel, etc. I use the simplest text file storage.
## 4.结果展示
def save_ip_text(ip_dic):
for ip in list(ip_dic.keys()):
with open("IP_Pool.txt", 'a', encoding='utf-8') as fd:
fd.write(str(ip) + ",\t" + str(ip_dic[ip]) + '\n')
print('可用IP池已保存至IP_Pool.txt')
def show_ip(ip_dic):
# 简单打印
for ip in list(ip_dic.keys()):
print(str(ip) + ":\t" + str(ip_dic[ip]))
operation result:
(5) Complete implementation of single-threaded IP pool
Completely build the proxy IP pool, in which the collection of IP and IP validity inspection use a single thread .
import random
import time
import re
from multiprocessing.dummy import Pool
import requests
from lxml import etree
# 1.获取网页静态源码的requests框架
def getHTMLText(url, data, headers, proxies, code='utf-8'):
try:
# headers 避免被检测出自身为程序访问 将自己伪装成浏览器
r = requests.get(url=url, params=data, headers=headers, proxies=proxies)
# t = random.randint(1, 5) # 随机睡眠 降低机器辩认度
# time.sleep(t)
r.raise_for_status()
r.encoding = code
return r.text
# 返回静态源码或异常提示
except:
return "GET异常"
# 2.代理池
# 1
def get_kuaidaili_IP():
# 获取快代理网站前三页IP及其端口
print('抓取快代理网站前三页IP及其端口')
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
parser = etree.HTMLParser(encoding="utf-8")
ip_dic = {
}
for i in range(1, 4):
url = 'https://free.kuaidaili.com/free/inha/' + str(i) + '/'
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('/html/body/div/div[4]/div[2]/div[2]/div[2]/table/tbody/tr/td[1]/text()')
post_list = tree.xpath('/html/body/div/div[4]/div[2]/div[2]/div[2]/table/tbody/tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
ip_dic = dict(ip_dic, **dic)
return ip_dic
# 2
def get_66ip_IP():
# 获取66免费代理网前三页IP及其端口
print('抓取66免费代理网前三页IP及其端口')
ip_dic = {
}
parser = etree.HTMLParser(encoding="utf-8")
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
def obtain(url):
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//*[@id="main"]/div[1]/div[2]/div[1]//tr/td[1]/text()')
post_list = tree.xpath('//*[@id="main"]/div[1]/div[2]/div[1]//tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
return dic
url = 'http://www.66ip.cn/index.html'
ip_dic = dict(ip_dic, **obtain(url))
for i in range(2, 4):
url = 'http://www.66ip.cn/' + str(i) + '.html'
ip_dic = dict(ip_dic, **obtain(url))
return ip_dic
# 3
def get_ip3366_IP():
# 获取3366云代理网站前三页IP及其端口
print('抓取3366云代理网站前三页IP及其端口')
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
parser = etree.HTMLParser(encoding="utf-8")
ip_dic = {
}
for i in range(1, 4):
url = 'http://www.ip3366.net/free/?stype=1&page=' + str(i)
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//*[@id="list"]/table/tbody/tr/td[1]/text()')
post_list = tree.xpath('//*[@id="list"]/table/tbody/tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
ip_dic = dict(ip_dic, **dic)
return ip_dic
# 4
def get_89ip_IP():
# 获取89免费代理网站前三页IP及其端口
print('抓取89免费代理网站前三页IP及其端口')
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
parser = etree.HTMLParser(encoding="utf-8")
ip_dic = {
}
for i in range(1, 4):
url = 'https://www.89ip.cn/index_1' + str(i) + '.html'
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//div[@class="layui-form"]//tr/td[1]/text()')
post_list = tree.xpath('//div[@class="layui-form"]//tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
ip_dic = dict(ip_dic, **dic)
return ip_dic
# 5
def get_kxdaili_IP():
# 获取云代理网站高匿与普匿两页IP及其端口
print('抓取云代理网站高匿与普匿两页IP及其端口')
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
parser = etree.HTMLParser(encoding="utf-8")
ip_dic = {
}
for i in range(1, 2):
url = 'http://www.kxdaili.com/dailiip.html'
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//div[@class="hot-product-content"]//tr/td[1]/text()')
post_list = tree.xpath('//div[@class="hot-product-content"]//tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
ip_dic = dict(ip_dic, **dic)
for i in range(1, 2):
url = 'http://www.kxdaili.com/dailiip/2/1.html'
html = getHTMLText(url=url, headers=headers, data=None, proxies=None)
tree = etree.HTML(html, parser=parser) # 加载html文件
ip_list = tree.xpath('//div[@class="hot-product-content"]//tr/td[1]/text()')
post_list = tree.xpath('//div[@class="hot-product-content"]//tr/td[2]/text()')
dic = dict(zip(ip_list, post_list))
ip_dic = dict(ip_dic, **dic)
return ip_dic
## 3.测试
def parse(html):
# 利用正则表达式 解析并获取页面中所有IP地址
ip_list = re.findall(
r'(?<![\.\d])(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)(?![\.\d])',
html)
return ip_list
def test(ip, port):
# 如果代理成功 则页面解析获取的IP应当与输入IP相同
# True 代理成功 False代理失败
print('开始测试' + str(ip) + '...')
url = 'http://httpbin.org/ip'
proxies = {
"http": f"http://{
ip}:{
port}", "https": f"http://{
ip}:{
port}"}
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
html = getHTMLText(url=url, headers=headers, data=None, proxies=proxies)
if html == "GET异常":
return False
return parse(html)[0] == ip
def test_list(ip_dic):
ip_list = list(ip_dic.keys())
for num in range(len(ip_list)):
if test(ip_list[num], ip_dic[ip_list[num]]):
print(str(ip_list[num]) + '有效')
else:
print(str(ip_list[num]) + '无效')
ip_dic.pop(ip_list[num])
return ip_dic
## 4.结果展示
def save_ip_text(ip_dic):
for ip in list(ip_dic.keys()):
with open("IP_Pool.txt", 'a', encoding='utf-8') as fd:
fd.write(str(ip) + ",\t" + str(ip_dic[ip]) + '\n')
print('可用IP池已保存至IP_Pool.txt')
def show_ip(ip_dic):
# 简单打印
for ip in list(ip_dic.keys()):
print(str(ip) + ":\t" + str(ip_dic[ip]))
def main():
print('------------------------------------------------')
print('------------------------------------------------')
print('1.开始初步IP收集')
ip_dic = {
}
ip_dic = dict(ip_dic, **get_kuaidaili_IP())
ip_dic = dict(ip_dic, **get_66ip_IP())
ip_dic = dict(ip_dic, **get_ip3366_IP())
ip_dic = dict(ip_dic, **get_89ip_IP())
ip_dic = dict(ip_dic, **get_kxdaili_IP())
print('2.完成初步IP收集')
print('抓取到共计\t' + str(len(ip_dic)) + '个IP')
print('------------------------------------------------')
print('------------------------------------------------')
print('3.开始可用性测试')
ip_dic = test_list(ip_dic)
print('------------------------------------------------')
print('------------------------------------------------')
print('4.有效IP存储')
save_ip_text(ip_dic)
print('最终有效IP数目计为\t' + str(len(ip_dic)))
if __name__ == '__main__':
main()
operation result:
"D:\Program Files\Python\python.exe"
------------------------------------------------
------------------------------------------------
1.开始初步IP收集
抓取快代理网站前三页IP及其端口
抓取66免费代理网前三页IP及其端口
抓取3366云代理网站前三页IP及其端口
抓取89免费代理网站前三页IP及其端口
抓取云代理网站高匿与普匿两页IP及其端口
2.完成初步IP收集
抓取到共计 100个IP
------------------------------------------------
------------------------------------------------
3.开始可用性测试
开始测试117.114.149.66...
117.114.149.66无效
开始测试122.9.101.6...
122.9.101.6无效
开始测试47.113.90.161...
47.113.90.161有效
开始测试222.74.73.202...
........
------------------------------------------------
------------------------------------------------
4.有效IP存储
可用IP池已保存至IP_Pool.txt
最终有效IP数目计为 5
Process finished with exit code 0
In the end, I successfully grabbed more than ten pages of free IPs from multiple websites and conducted a usability test, and finally saved them in a local text file.
After that, you can randomly extract the text file when you use it.
But we can also see that the efficiency of free proxy is indeed very low . After screening, there are less than 10 IPs that can be used in the end.
(6) Multi-threaded IP verification
Putting the initially collected dictionary into URLS can complete multi-threaded verification.
Compared with the previous single-thread speed, it has improved a lot.
import threading
import requests
import time
import queue
import re
start = time.time()
# 填充队列
URLs = {
'120.220.220.95': '8085',
'101.200.127.149': '3129',
'183.247.199.215': '30001',
'61.216.185.88': '60808'
}
# 为线程定义一个函数
class myThread(threading.Thread):
# 定义线程
def __init__(self, name, q):
threading.Thread.__init__(self)
# 线程名称
self.name = name
#
self.q = q
def run(self):
# 开始线程
print("Starting " + self.name)
while True:
try:
# 执行crawl耗时操作
crawl(self.name, self.q)
except:
break
# 退出线程
print("Exiting " + self.name)
def getHTMLText(url, data, headers, proxies, code='utf-8'):
try:
r = requests.get(url=url, params=data, headers=headers, proxies=proxies)
r.raise_for_status()
r.encoding = code
return r.text
except:
return "GET异常"
def parse(html):
# 利用正则表达式 解析并获取页面中所有IP地址
ip_list = re.findall(
r'(?<![\.\d])(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)(?![\.\d])',
html)
return ip_list
def crawl(threadNmae, q):
ip = q.get(timeout=2)
print(threadNmae + '开始测试' + str(ip) + '...')
url = 'http://httpbin.org/ip'
proxies = {
"http": f"http://{
ip}:{
URLs.get(ip)}", "https": f"http://{
ip}:{
URLs.get(ip)}"}
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.33'
}
html = getHTMLText(url=url, headers=headers, data=None, proxies=proxies)
if html == "GET异常":
print(str(ip) + '无效')
return False
if parse(html)[0] == ip:
print(str(ip) + '有效')
else:
print(str(ip) + '无效')
URLs.pop(ip)
return parse(html)[0] == ip
workQueue = queue.Queue(len(URLs.keys()))
for url in URLs.keys():
workQueue.put(url)
threads = []
for i in range(1, 5):
# 创建4个新线程
thread = myThread("Thread-" + str(i), q=workQueue)
# 开启新线程
thread.start()
# 添加新线程到线程列表
threads.append(thread)
# 等待所有线程完成
for thread in threads:
thread.join()
end = time.time()
print("Queue多线程IP验证耗时:{} s".format(end - start))
print("Exiting Main Thread")
operation result:
"D:\Program Files\Python\python.exe"
Starting Thread-1
Thread-1开始测试120.220.220.95...
Starting Thread-2
Thread-2开始测试101.200.127.149...
Starting Thread-3
Thread-3开始测试183.247.199.215...
Starting Thread-4
Thread-4开始测试61.216.185.88...
183.247.199.215无效
101.200.127.149有效
Exiting Thread-3
Exiting Thread-2
120.220.220.95有效
Exiting Thread-1
61.216.185.88无效
Exiting Thread-4
Queue多线程IP验证耗时:23.041887998580933 s
Exiting Main Thread
Process finished with exit code 0