对微博进行爬虫的时候,一定要注意一下访问频率

基本测试脚本(python):

import time,requests

def test_ip_freq(freq):
    if freq==0:
        return
    #测试1分钟
    delay=1/freq
    t0=time.time()
    requests_num=0
    status="success"
    while 1:
        r = requests.get("https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D61%26q%3D%E7%96%AB%E6%83%85%26t%3D0&page_type=searchall&page=2")
        if r.status_code != 200:
            status='fail'
            break
        requests_num+=1
        if time.time()-t0>5*60:
            break
        time.sleep(delay)
    print("当前的访问频率是{0}/s,状态:{1},请求总数{2},耗时{3}s, 实际频率{4}".format(freq,status,requests_num,time.time()-t0,requests_num/(time.time()-t0)))
    return status
for i in [0.3,0.35,0.4,0.45,0.5]:
    status=test_ip_freq(i)
    if status=='fail':
        break
#统计ip被封的时间
t0=time.time()
while 1:
    r = requests.get("https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D61%26q%3D%E7%96%AB%E6%83%85%26t%3D0&page_type=searchall&page=2")
    if r.status_code == 200:
        break
    time.sleep(10)
print("ip被封的时间是{0}s".format(time.time()-t0))

测试结果:

当前的访问频率是0.3/s,状态:success,请求总数81,耗时303.2352440357208s, 实际频率0.2671193457659502
当前的访问频率是0.35/s,状态:success,请求总数91,耗时302.8865134716034s, 实际频率0.30044256166107425
当前的访问频率是0.4/s,状态:fail,请求总数53,耗时164.40774130821228s, 实际频率0.3223692484202544
ip被封的时间是183s

 

https代理推荐: 

芝麻代理: http://h.zhimaruanjian.com/ 

猜你喜欢

转载自www.cnblogs.com/xunhanliu/p/13384771.html