问题描述:
当用快速爬取某网站出现经常出现Traceback (most recent call last):的错误,也就是连接失败。原因首先是快速爬取连接时网络不稳定造成的,于是写了个多次尝试连接的函数。
错误界面:
Traceback (most recent call last):
File "E:/pycharm/PycharmProjects/爬虫/BG5.py", line 118, in <module>
main(j)
File "E:/pycharm/PycharmProjects/爬虫/BG5.py", line 84, in main
response1 = getHTMLText(data[j][0])
File "E:/pycharm/PycharmProjects/爬虫/BG5.py", line 54, in getHTMLText
response = requests.get(url, headers=kv, timeout=60)
File "E:\pycharm\PycharmProjects\venv\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "E:\pycharm\PycharmProjects\venv\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "E:\pycharm\PycharmProjects\venv\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "E:\pycharm\PycharmProjects\venv\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "E:\pycharm\PycharmProjects\venv\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.wzfg.com', port=80): Max retries exceeded with url: /realweb/stat/ProjectListHouseAll.jsp?status=&projectid=9001708&permitNo=%E7%91%9E%E5%AE%89%E5%B8%82%E5%94%AE%E8%AE%B8%E5%AD%97(2017)%E7%AC%AC010%E5%8F%B7 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000000000D42E208>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。',))
解决方法:
def getHTMLText(url):
maxTryNum = 20
for tries in range(maxTryNum):
try:
kv = {"user-agent": "Mizilla/5.0"}
response = requests.get(url, headers=kv, timeout=60)
return response.text
except:
if tries < (maxTryNum - 1):
continue
else:
print("Has tried %d times to access url %s, all failed!", maxTryNum, url)
break