【Python爬虫】性能提升

并发、异步IO

在编写爬虫时,性能的消耗主要在IO请求中。当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

import requests

def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']

for url in url_list:
    print(url,fetch_async(url))
1.同步执行
from concurrent.futures import ThreadPoolExecutor
import requests


def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ThreadPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)

pool.shutdown(wait=True)
2-多线程(线程池)执行
"""并发未来-线程池"""
from concurrent.futures import ThreadPoolExecutor
import time
import requests

def task(url):
    response = requests.get(url)
    print(url,response.status_code)
    response.encoding = response.apparent_encoding
    if response.status_code == 200:
        return {"url":url,"text":response.text}

def save_to_html(res,*args,**kwargs):
    res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
    filename = res['url'].split(".")[-2] + ".html"
    with open(filename,'w+') as f:
        f.write(res["text"])
    print(filename,"--->写入成功!")

def parse_html(res,*args,**kwargs):
    pass

if __name__ == '__main__':
    start = time.time()
    pool = ThreadPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
    url_list = [
        'http://www.cnblogs.com/',
        'https://huaban.com/favorite/beauty/',
        'http://www.bing.com',
        'http://www.zhihu.com',
        'http://www.sina.com',
        'http://www.baidu.com',
        'http://www.autohome.com.cn',
    ]
    for url in url_list:
        v = pool.submit(task,url)
        v.add_done_callback(save_to_html)
        v.add_done_callback(parse_html)

    pool.shutdown(wait=True)
    print("consume time is:",time.time()-start)
3-多线程+回调函数
from concurrent.futures import ProcessPoolExecutor
import requests

def fetch_async(url):
    response = requests.get(url)
    return response


url_list = ['http://www.github.com', 'http://www.bing.com']
pool = ProcessPoolExecutor(5)
for url in url_list:
    pool.submit(fetch_async, url)

pool.shutdown(wait=True)
4-多进程
"""并发未来-进程池"""
from concurrent.futures import ProcessPoolExecutor
import time
import requests

def task(url):
    response = requests.get(url)
    print(url,response.status_code)
    response.encoding = response.apparent_encoding
    if response.status_code == 200:
        return {"url":url,"text":response.text}

def save_to_html(res,*args,**kwargs):
    res = res.result()    #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict>
    filename = res['url'].split(".")[-2] + ".html"
    with open(filename,'w+') as f:
        f.write(res["text"])
    print(filename,"--->写入成功!")

def parse_html(res,*args,**kwargs):
    pass

if __name__ == '__main__':
    start = time.time()
    pool = ProcessPoolExecutor()    #线程池 不过不指定值 默认为CPU*5
    url_list = [
        'http://www.cnblogs.com/',
        'https://huaban.com/favorite/beauty/',
        'http://www.bing.com',
        'http://www.zhihu.com',
        'http://www.sina.com',
        'http://www.baidu.com',
        'http://www.autohome.com.cn',
    ]
    for url in url_list:
        v = pool.submit(task,url)
        v.add_done_callback(save_to_html)
        v.add_done_callback(parse_html)

    pool.shutdown(wait=True)
    print("consume time is:",time.time()-start)
5-多进程+回调函数

通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO首选:

补充:协程+异步IO(还举例讲了 并发、并行、同步、异步、阻塞、非阻塞

参考:https://blog.csdn.net/weixin_41207499/article/details/80657201

参考:https://www.cnblogs.com/ssyfj/p/9222342.html

https://www.liaoxuefeng.com/wiki/1016959663602400/1017985577429536

猜你喜欢

转载自www.cnblogs.com/XJT2018/p/11002526.html