table of Contents
table of Contents
The main use of libraries and tools
requests
aiohttp
lxml
Beautiful Soup
pyquery
asyncio
fake_useragent
pymongo
MongoDB
python3.7
I. Introduction
- Code Analysis page crawling web page information (information crawled 10);
- Experience at using different analytical repository to get the agent (
IP:Port
and type); - Acting get tested for screening;
- The success of the screening agent into MongoDB.
II. Process
(A) analysis http://www.xicidaili.com/nn/1 page code
1. page analysis
Web page crawling to the first page as follows, IP address, port, and the type is our goal to crawl.
Into the second page, observe the url
changes:
Can be found url
from http://www.xicidaili.com/nn/1 become http://www.xicidaili.com/nn/2 , so on the following pages, links can be drawn http: // www.xicidaili.com/nn/ the behind with figures on behalf of the page is the first of several.
Next, open the page developer tools, enter Network
, navigate to the page content received as follows:
Can be found, we will be crawling proxy information exists to a pair of <tr><\tr>
tab, continue further analysis, you can also find these labels class
either "odd"
, or is ""
, but the information we want, the ip
address is tr
the second under the label td
label, the third port is located on td
the label, located on the sixth type of td
label.
Then we can begin to use the library to try to resolve the crawl, respectively lxml
, Beautiful Soup
as well as pyquery
three common parsing library to parse.
2. crawl to top
Use requests
the library (and later because of the use of reason and proxy test proxy, instead became an asynchronous request coroutine library aiohttp
, see: question: ip address is banned ), first try direct access to:
import requests
response = requests.get("http://www.xicidaili.com/nn/1")
print(response.text)
The results are as follows:
<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body bgcolor="white">
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>
Returns a status code 503, indicates that the service is not available, and therefore have to do deal with, try to join the request header:
import requests
header = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
response = requests.get("http://www.xicidaili.com/nn/1", headers = header)
print(response.text)
Output:
<!DOCTYPE html>
<html>
<head>
<title>国内高匿免费HTTP代理IP__第1页国内高匿</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta name="Description" content="国内高匿免费HTTP代理" />
<meta name="Keywords" content="国内高匿,免费高匿代理,免费匿名代理,隐藏IP" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=0">
<meta name="applicable-device"content="pc,mobile">
......
This will normally get the information page. Next, choose to use different parsing library page resolution.
(Ii) using different analytic information library crawling
1. Use the lxml
library to parse
import requests
def get_page():
try:
header = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
response = requests.get("http://www.xicidaili.com/nn/1", headers = header)
get_detail(response.text)
except Exception as e:
print("发生错误: ", e)
# 使用lxml爬取
from lxml import etree
def get_detail(html):
html = etree.HTML(html)
# 爬取ip地址信息
print(html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()'))
if __name__ == "__main__":
get_page()
First, try to get all the ip address information of the first page, using XPath rule '//tr[@class="odd"]/td[2]/text()'
extraction results are as follows:
['121.40.66.129', '117.88.177.132', '117.88.176.203', '218.21.230.156', '121.31.101.41', '60.205.188.24', '221.206.100.133', '27.154.34.146', '58.254.220.116', '39.91.8.31', '221.218.102.146', '223.10.21.0', '58.56.149.198', '219.132.205.105', '221.237.37.97', '183.163.24.15', '171.80.196.14', '118.114.96.251', '114.239.91.166', '111.222.141.127', '121.237.148.133', '123.168.67.126', '118.181.226.166', '121.237.148.190', '124.200.36.118', '58.58.213.55', '49.235.253.240', '183.147.11.34', '121.40.162.239', '121.237.148.139', '121.237.148.118', '117.88.5.174', '117.88.5.234', '117.87.180.144', '119.254.94.93', '60.2.44.182', '175.155.239.23', '121.237.148.156', '118.78.196.186', '123.118.108.201', '117.88.4.71', '113.12.202.50', '117.88.177.34', '117.88.4.35', '222.128.9.235', '121.237.148.131', '121.237.149.243', '121.237.148.8', '182.61.179.157', '175.148.68.133']
The results are no errors, the same way you can get the port and type:
from lxml import etree
def get_detail(html):
html = etree.HTML(html)
# 爬取ip地址信息
print(html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')[:10])
# 爬取端口信息
print(html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')[:10])
# 爬取类型信息
print(html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')[:10])
# 统计一页有多少条数据
print(len(html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')))
As a result, a display output data, a total of 100:
['121.237.149.117', '121.237.148.87', '59.44.78.30', '124.93.201.59', '1.83.117.56', '117.88.176.132', '121.40.66.129', '222.95.144.201', '117.88.177.132', '121.237.149.132']
['3000', '3000', '42335', '59618', '8118', '3000', '808', '3000', '3000', '3000']
['HTTP', 'HTTP', 'HTTP', 'HTTPS', 'HTTP', 'HTTP', 'HTTP', 'HTTP', 'HTTP', 'HTTP']
100
2. Use Beautiful Soup
to parse
Page table structure is as follows:
<table id="ip_list">
<tr>
<th class="country">国家</th>
<th>IP地址</th>
<th>端口</th>
<th>服务器地址</th>
<th class="country">是否匿名</th>
<th>类型</th>
<th class="country">速度</th>
<th class="country">连接时间</th>
<th width="8%">存活时间</th>
<th width="20%">验证时间</th>
</tr>
<tr class="odd">
<td class="country"><img src="//fs.xicidaili.com/images/flag/cn.png" alt="Cn" /></td>
<td>222.128.9.235</td>
<td>59593</td>
<td>
<a href="/2018-09-26/beijing">北京</a>
</td>
<td class="country">高匿</td>
<td>HTTPS</td>
<td class="country">
<div title="0.032秒" class="bar">
<div class="bar_inner fast" style="width:87%">
</div>
</div>
</td>
<td class="country">
<div title="0.006秒" class="bar">
<div class="bar_inner fast" style="width:97%">
</div>
</div>
</td>
<td>533天</td>
<td>20-03-13 15:21</td>
</tr>
...
First select table
all under the tr
label:
from bs4 import BeautifulSoup
def get_detail(html):
soup = BeautifulSoup(html, 'lxml')
c1 = soup.select('#ip_list tr')
print(c1[1])
The results are as follows:
<tr class="odd">
<td class="country"><img alt="Cn" src="//fs.xicidaili.com/images/flag/cn.png"/></td>
<td>222.128.9.235</td>
<td>59593</td>
<td>
<a href="/2018-09-26/beijing">北京</a>
</td>
<td class="country">高匿</td>
<td>HTTPS</td>
<td class="country">
<div class="bar" title="0.032秒">
<div class="bar_inner fast" style="width:87%">
</div>
</div>
</td>
<td class="country">
<div class="bar" title="0.006秒">
<div class="bar_inner fast" style="width:97%">
</div>
</div>
</td>
<td>533天</td>
<td>20-03-13 15:21</td>
</tr>
The next step is for each tr
tag in the second ( ip
), the third (ports) and sixth (type) td
label selected out of:
from bs4 import BeautifulSoup
def get_detail(html):
soup = BeautifulSoup(html, 'lxml')
c1 = soup.select('#ip_list tr')
ls = []
for index, tr in enumerate(c1):
if index != 0:
td = tr.select('td')
ls.append({'proxies': td[1].string + ":" + td[2].string,
'types': td[5].string})
print(ls)
print(len(ls))
The results are as follows:
[{'proxies': '222.128.9.235:59593', 'types': 'HTTPS'}, {'proxies': '115.219.105.60:8010', 'types': 'HTTP'}, {'proxies': '117.88.177.204:3000', 'types': 'HTTP'}, {'proxies': '222.95.144.235:3000', 'types': 'HTTP'}, {'proxies': '59.42.88.110:8118', 'types': 'HTTPS'}, {'proxies': '118.181.226.166:44640', 'types': 'HTTP'}, {'proxies': '121.237.149.124:3000', 'types': 'HTTP'}, {'proxies': '218.86.200.26:8118', 'types': 'HTTPS'}, {'proxies': '106.6.138.18:8118', 'types': 'HTTP'}......]
100
Page 100 data, the result is correct.
3. Use pyquery
to parse
pyquery
Analytical methods and Beautiful Soup
similar, first delete the first row of the table, and then select the table tr
Tags:
from pyquery import PyQuery as pq
def get_detail(html):
doc = pq(html)
doc('tr:first-child').remove() # 删除第一行
items = doc('#ip_list tr')
print(items)
As can be seen from the output items
format of each item:
...
<tr class="">
<td class="country"><img src="//fs.xicidaili.com/images/flag/cn.png" alt="Cn"/></td>
<td>124.205.143.210</td>
<td>34874</td>
<td>
<a href="/2018-10-05/beijing">北京</a>
</td>
<td class="country">高匿</td>
<td>HTTPS</td>
<td class="country">
<div title="0.024秒" class="bar">
<div class="bar_inner fast" style="width:93%">
</div>
</div>
</td>
<td class="country">
<div title="0.004秒" class="bar">
<div class="bar_inner fast" style="width:99%">
</div>
</div>
</td>
<td>523天</td>
<td>20-03-12 02:20</td>
</tr>
...
Next, each one taken out by the generator, a second select td
tag ( ip
address), a third td
label (port number), and the sixth td
tag (Type), for storing a list of dictionary format.
from pyquery import PyQuery as pq
def get_detail(html):
doc = pq(html)
doc('tr:first-child').remove() # 删除第一行
items = doc('#ip_list tr')
ls = []
for i in items.items():
tmp1 = i('td:nth-child(2)') # 选取ip地址
tmp2 = i('td:nth-child(3)') # 选取端口
tmp3 = i('td:nth-child(6)') # 选取类型
ls.append({'proxies': tmp1.text() + ":" + tmp2.text(),
'types': tmp3.text()})
print(ls)
print(len(ls))
Output:
[{'proxies': '222.128.9.235:59593', 'types': 'HTTPS'}, {'proxies': '115.219.105.60:8010', 'types': 'HTTP'}, {'proxies': '117.88.177.204:3000', 'types': 'HTTP'}, {'proxies': '222.95.144.235:3000', 'types': 'HTTP'}, {'proxies': '59.42.88.110:8118', 'types': 'HTTPS'}, {'proxies': '118.181.226.166:44640', 'types': 'HTTP'}, {'proxies': '121.237.149.124:3000', 'types': 'HTTP'}, {'proxies': '218.86.200.26:8118', 'types': 'HTTPS'}......
100
Results per page 100 the data is correct.
(C) select Baidu site to test the proxy fetching obtained
Climb to get free agents in many of which are difficult to use or instability, can not be directly stored, so the need to select a site to test whether we can crawl to the proxy request is successful, I chose http: //www.baidu.com as a test, only the successful agency will request to speak to add it to the database, the request fails more than three times the number of discarded.
For the detection of such agents typically require ten seconds or even longer things, using a queued requests detected clearly unreasonable, it is necessary to select the library asynchronous request aiohttp
, the asynchronous About coroutine, reference may Python asynchronous coroutine the use of the method described , on aiohttp
introduction, refer aiohttp Chinese documents .
The two main keywords that await
and async
, simply put, is likely waiting in the thread where A plus a await
modification, and then thread to this place at this time would not be waiting for a dry, but ran to perform other tasks B , after waiting until the object a responsive, came back immediately continue with other tasks below a, then B task temporarily shelved. However, await
the latter object must be coroutine
an object, or may return an coroutine
object generator, or containing __await
a iterator object returned by the method (which is why not directly requests
preceded await
reasons). And we add functions async
after modification, function returns an object becomes a coroutine
target, so he could "no brain" to be added await
to the async
portfolio, of course, if you add await
places not need to wait for the kind of request response or waiting for data upload and download places to go like a thread into the blocked state, it will not play any effect, of course, it would not be wrong.
Detection agents function as follows:
# 测试代理
async def test_proxy(self, dic):
## 根据类型构造不同的代理及url
if dic["types"] == "HTTP":
test_url = "http://www.baidu.com/"
prop = "http://" + dic["proxies"]
else:
test_url = "https://www.baidu.com/"
prop = "https://" + dic["proxies"]
ua = UserAgent()
header = {'User-Agent': ua.random}
# 异步协程请求
async with aiohttp.ClientSession() as session:
while True:
try:
async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
if resp.status == 200:
self.success_test_count += 1
print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
return
except Exception as e:
print(prop, "==测试失败,放弃==", e)
break
(D) select a stored database
After taking into account of the agent pool will be further maintained, and therefore choose to use MongoDB for storage and, where the data can be easily inserted to avoid duplication, database storage functions as follows:
# 写入MongoDB数据库
async def insert_to_mongo(self, dic):
db = self.client.Myproxies
collection = db.proxies
collection.update_one(dic,{'$set': dic}, upsert=True) # 设置upsert=True,避免重复插入
print("\033[5;32;40m插入记录:" + json.dumps(dic), "\033[;;m")
(E) the full code
1. Use a proxy crawling stage version
Finally, the complete code is as follows (this is in the crawling stage on the use of proxy version of the agent making the request, so I this machine ip
was closed, so I had to do, the process will be slow, crawling back to continue posting data stage without proxy, using proxy version of the test), select the beginning of the three when it comes to parsing library lxml
to parse:
import json
import time
import random
from fake_useragent import UserAgent
import asyncio
import aiohttp
# 避免出现RuntimeError错误
import nest_asyncio
nest_asyncio.apply()
from lxml import etree
import pymongo
class Get_prox:
def __init__(self):
# 初始化,连接MongoDB
self.client = pymongo.MongoClient('mongodb://localhost:27017/')
self.success_get_count = 0
self.success_test_count = 0
# 使用代理时,获取页面
async def get_page(self, session, url):
## 一个随机生成请求头的库
ua = UserAgent()
header = {'User-Agent': ua.random}
# 从本地文件获取代理池
proxies_pool = self.get_proxies()
while True:
try:
# 由于我一开始操作不慎ip被封禁了,因此在一开始抓取ip时我不得不使用了自己从
# 其他网站抓来的一批代理(如问题描述中所述),一共有5999条代理,每次随机选取一条
p = 'http://' + random.choice(proxies_pool)
async with session.get(url, headers = header, proxy = p, timeout = 10) as response:
await asyncio.sleep(2)
if response.status == 200:
self.success_get_count += 1
print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
return await response.text()
else:
print("\033[5;31;m", response.status, "\033[;;m")
continue
except Exception as e:
print("请求失败orz", e)
# 任务
async def get(self, url):
async with aiohttp.ClientSession() as session:
html = await self.get_page(session, url)
await self.get_detail(html)
# 测试代理
async def test_proxy(self, dic):
## 根据类型构造不同的代理及url
if dic["types"] == "HTTP":
test_url = "http://www.baidu.com/"
prop = "http://" + dic["proxies"]
else:
test_url = "https://www.baidu.com/"
prop = "https://" + dic["proxies"]
ua = UserAgent()
header = {'User-Agent': ua.random}
# 异步协程请求
async with aiohttp.ClientSession() as session:
while True:
try:
async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
if resp.status == 200:
self.success_test_count += 1
print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
return
except Exception as e:
print(prop, "==测试失败,放弃==", e)
break
# 获取代理池
def get_proxies(self):
with open("proxies.txt", "r") as f:
ls = json.loads(f.read())
return ls
# 使用lxml爬取
async def get_detail(self, html):
html = etree.HTML(html)
dic = {}
ip = html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')
port = html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')
types = html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')
for i in range(len(ip)):
dic['proxies'] = ip[i] + ":" + port[i]
dic['types'] = types[i]
await self.test_proxy(dic)
# 写入MongoDB数据库
async def insert_to_mongo(self, dic):
db = self.client.Myproxies
collection = db.proxies
collection.update_one(dic,{'$set': dic}, upsert=True) # 设置upsert=True,避免重复插入
print("\033[5;32;40m插入记录:" + json.dumps(dic), "\033[;;m")
# 主线程
if __name__ == "__main__":
urls = []
start = time.time()
# 抓取前10页数据
for i in range(1, 11):
urls.append("http://www.xicidaili.com/nn/" + str(i))
c = Get_prox()
# 创建10个未来任务对象
tasks = [asyncio.ensure_future(c.get(url)) for url in urls]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
total = (end - start)/60.0
print("完成,总耗时:", total, "分钟!")
Implementation process will print a lot of logs, the log part as follows:
Whether the request process or testing process, the agent ip
requests the success rate is very low, completely finished execution needs some time to display time-consuming after the completion of 47 minutes.
A brief look at the log, see the last data is successfully inserted the eighth. . . . .
Database to look over, which I repeated several times after running the data in the database, it is only inserted 50:
2. crawling stage version of the agent is not used
Continue posted version of the crawling stage without the use of proxy data, ie the use of requests
crawling, and then testing aiohttp
, eliminating the waiting time of the first phase of the screening agent process.
import json
import time
import requests
from fake_useragent import UserAgent
import asyncio
import aiohttp
# 避免出现RuntimeError错误
import nest_asyncio
nest_asyncio.apply()
from lxml import etree
import pymongo
class Get_prox:
def __init__(self):
# 初始化,连接MongoDB
self.client = pymongo.MongoClient('mongodb://localhost:27017/')
self.success_get_count = 0
self.success_test_count = 0
# 不使用代理时,获取页面
def get_page(self, url):
## 一个随机生成请求头的库
ua = UserAgent()
header = {'User-Agent': ua.random}
while True:
try:
response = requests.get(url, headers = header, timeout = 10)
time.sleep(1.5)
if response.status_code == 200:
self.success_get_count += 1
print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
return response.text
else:
print("\033[5;31;m", response.status_code, "\033[;;m")
continue
except Exception as e:
print("请求失败orz", e)
# 任务
def get(self, urls):
htmls = []
# 先将抓取的页面都存入列表中
for url in urls:
htmls.append(self.get_page(url))
# 测试代理使用异步
tasks = [asyncio.ensure_future(self.get_detail(html)) for html in htmls]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
# 测试代理
async def test_proxy(self, dic):
## 根据类型构造不同的代理及url
if dic["types"] == "HTTP":
test_url = "http://www.baidu.com/"
prop = "http://" + dic["proxies"]
else:
test_url = "https://www.baidu.com/"
prop = "https://" + dic["proxies"]
ua = UserAgent()
header = {'User-Agent': ua.random}
# 异步协程请求
async with aiohttp.ClientSession() as session:
while True:
try:
async with session.get(test_url, headers = header, proxy = prop, timeout = 15, verify_ssl=False) as resp:
if resp.status == 200:
self.success_test_count += 1
print(prop, "\033[5;36;40m===========>测试成功,写入数据库!=========%d次\033[;;m"%self.success_test_count)
await self.insert_to_mongo(dic) ## 调用写入mongodb数据库的函数
return
except Exception as e:
print(prop, "==测试失败,放弃==", e)
break
# 使用lxml爬取
async def get_detail(self, html):
html = etree.HTML(html)
dic = {}
ip = html.xpath('//tr[@class="odd" or @class=""]/td[2]/text()')
port = html.xpath('//tr[@class="odd" or @class=""]/td[3]/text()')
types = html.xpath('//tr[@class="odd" or @class=""]/td[6]/text()')
for i in range(len(ip)):
dic['proxies'] = ip[i] + ":" + port[i]
dic['types'] = types[i]
await self.test_proxy(dic)
# 写入MongoDB数据库
async def insert_to_mongo(self, dic):
db = self.client.Myproxies
collection = db.proxies
collection.update_one(dic,{'$set': dic}, upsert=True) # 设置upsert=True,避免重复插入
print("\033[5;32;40m插入记录:" + json.dumps(dic) + "\033[;;m")
# 主线程
if __name__ == "__main__":
urls = []
start = time.time()
# 抓取前10页数据
for i in range(1, 11):
urls.append("http://www.xicidaili.com/nn/" + str(i))
c = Get_prox()
c.get(urls)
end = time.time()
total = (end - start)/60.0
print("完成,总耗时:", total, "分钟!")
By other small partners measured results shots are as follows:
10 requests crawling stage went very smoothly.
Finally, total time 19 minutes, not visible in front of crawling stage screening agent can really save a lot of free time!
IV. Problems and Solutions
(A) ip
addresses are banned
Since the beginning of use lxml
when parsing library to explore the parsing rules, in order to set the sleep time is not convenient, and later due to negligence, the crawl pages of major forget to set the sleep time, the result after crawling several times and found the log the output information content becomes as follows:
{"proxies": "121.237.148.195:3000", "types": "HTTP"}
{"proxies": "121.234.31.44:8118", "types": "HTTPS"}
{"proxies": "117.88.4.63:3000", "types": "HTTP"}
{"proxies": "222.95.144.58:3000", "types": "HTTP"}
发生错误: 'NoneType' object has no attribute 'xpath'
发生错误: 'NoneType' object has no attribute 'xpath'
发生错误: 'NoneType' object has no attribute 'xpath'
发生错误: 'NoneType' object has no attribute 'xpath'
发生错误: 'NoneType' object has no attribute 'xpath'
发生错误: 'NoneType' object has no attribute 'xpath'
......
After termination of the program to obtain response status code printing appears with the following results:
503
503
503
503
503
...
I also can not enter through the browser to the site, which can be drawn due to crawling too many times, my IP has been banned web pages.
- Solution
At first I was selected directly in other free agents ip ip over a few selected sites, but found free proxy ip have a very large proportion are not used, the use of the Internet to build a proxy ip pool of existing projects in the environment and the configuration is dependent on time-consuming, so I went straight to the 66 free agents network , using a free ip extraction function of the site to extract the 6000 proxy ip:
directly from the page that contains the 6000 proxy information click extraction, then you can write a simple program to generate this direct to crawl the page 6000 (5999 actually captured) proxy information to a local file:
response1 = requests.get("http://www.66ip.cn/mo.php?sxb=&tqsl=6000&port=&export=&ktip=&sxa=&submit=%CC%E1++%C8%A1&textarea=")
html = response1.text
print(response1.status_code == 200)
pattern = re.compile("br />(.*?)<", re.S)
items = re.findall(pattern, html)
for i in range(len(items)):
items[i] = items[i].strip()
print(len(items))
with open("proxies.txt", "w") as f:
f.write(json.dumps(items))
Then read this file as a crawler agent pool:
# 获取代理池
def get_proxies(self):
with open("proxies.txt", "r") as f:
ls = json.loads(f.read())
return ls
Then each request randomly select a proxy agent pool from:
def get_page(ls):
url = []
ua = UserAgent()
with open("proxies.txt", "r") as f:
ls = json.loads(f.read())
for i in range(1, page+1):
url.append("http://www.xicidaili.com/nn/" + str(i))
count = 1
errcount = 1
for u in url:
while True:
try:
header = {'User-Agent': ua.random}
handler = {'http': 'http://' + random.choice(ls)}
response = requests.get(u, headers = header, proxies = handler, timeout = 10)
time.sleep(1)
get_detail(response.text)
if response.status_code == 200:
print("选取ip:", handler, "请求成功---------------------------第%d次"%count)
count += 1
else:
continue
break
except:
print("选取ip:", handler, ", 第%d请求发生错误"%errcount)
errcount += 1
But do have a problem is when scheduling the thread can only be responsible for a task, but there are many ip proxy are difficult to use, resulting in each attempt takes time for several seconds, but in most cases requests to have an error.
To solve this problem, we can not choose to use single-threaded single-step approach to scheduling crawl pages, so I chose to use the asynchronous request library aiohttp
.
Reference articles to use Python asynchronous coroutine introduction and aiohttp Chinese documents , I learned to create coroutine object with 10 tasks (crawling tasks 10 pages) to implement asynchronous coroutine scheduler, so that each when a thread encounters a request, without waiting for the requested task, and can schedule the next task, when 10 requests are successful, we can enter the next function call, so the total time consumption can be reduced by about 10 times, and method is as follows (all functions not listed):
# 使用代理时,获取页面
async def get_page(self, session, url):
## 一个随机生成请求头的库
ua = UserAgent()
header = {'User-Agent': ua.random}
# 从本地文件获取代理池
proxies_pool = self.get_proxies()
while True:
try:
# 由于我一开始操作不慎ip被封禁了,因此在一开始抓取ip时我不得不使用了自己从
# 其他网站抓来的一批代理(如问题描述中所述),一共有5999条代理,每次随机选取一条
p = 'http://' + random.choice(proxies_pool)
async with session.get(url, headers = header, proxy = p, timeout = 10) as response:
await asyncio.sleep(2)
if response.status == 200:
self.success_get_count += 1
print("\033[5;36;40m----------------------请求成功-------------------%d次\033[;;m"%self.success_get_count)
return await response.text()
else:
print("\033[5;31;m", response.status, "\033[;;m")
continue
except Exception as e:
print("请求失败orz", e)
# 任务
async def get(self, url):
async with aiohttp.ClientSession() as session:
html = await self.get_page(session, url)
await self.get_detail(html)
# 主线程
if __name__ == "__main__":
urls = []
start = time.time()
# 抓取前10页数据
for i in range(1, 11):
urls.append("http://www.xicidaili.com/nn/" + str(i))
c = Get_prox()
# 创建10个未来任务对象
tasks = [asyncio.ensure_future(c.get(url)) for url in urls]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
total = (end - start)/60.0
print("完成,总耗时:", total, "分钟!")
Crawling process as part of the printed journal, visible, request agent probability of success is very low, the next step is to wait:
(B) asynchronous operation error RuntimeError
error
When a coroutine started asynchronously running programs, the error log console output follows:
RuntimeError: asyncio.run() cannot be called from a running event loop
Internet search solutions, adding at the beginning of the program:
import nest_asyncio
nest_asyncio.apply()
After the error will not, and the specific reasons unknown.
V. further areas for improvement
- If not for
ip
being banned crawling stage agency directly requests like, pay attention to set the sleep time. In fact, when then crawling agent can simultaneously on a number of different agents crawling the site, so that you can put mechanisms in conjunction with asynchronous requests also came in. For example, create multiple tasks, each task separately using different requests for web site requests, which are then added to the event loop asynchronous task coroutine go. - My approach is the only agent into the local database, is static, there are many agents on the network pool projects are dynamically maintained, there is
web
the interface,api
interface, but also more complex implementation, follow-up can be further in-depth Learn.