简单的比较同步|多线程|协程的爬取速度

「本章,啥也不干,就来简略的比较比较爬虫速度」

先上结果,以下结果是多次运行后取的最优结果。不同时间段对于速率影响还是有的。参考即可

"""
普通函数执行:总耗时 3.330171585083008 S
线程池执行:总耗时 总耗时 1.6058530807495117 S
多线程执行:总耗时 总耗时 1.8512330055236816 S
协程异步执行:总耗时 总耗时 总耗时 1.091230869293213 S
线程池协程异步:总耗时 0.8936080932617188 S
"""

普通函数

也就是同步爬虫

import requests
import time

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

def run():
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    for Url in List_Url:
        response = requests.get(Url,headers=headers)
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content =response.content
            w.write(content)

if __name__ == '__main__':
    s = time.time()
    run()
    print("总耗时 {} S".format(time.time() - s))

这个没什么看的,下一个


线程池与多线程爬虫

import requests
import time
from concurrent.futures import ThreadPoolExecutor
import threading

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

def run(Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    response = requests.get(Url,headers=headers)
    with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
        content =response.content
        w.write(content)

if __name__ == '__main__':
    s = time.time()

    with ThreadPoolExecutor(max_workers=5) as pool:
        # for Url in List_Url:
        #     pool.submit(run,Url)
        pool.map(run,List_Url)

    # threads = []
    # for url in List_Url:
    #     thread = threading.Thread(target=run, args=(url,))
    #     thread.start()
    #     threads.append(thread)
    # [j.join() for j in threads]
    print("总耗时 {} S".format(time.time() - s))

一般来说,爬虫最常用的就是它两了,效率上其实影响不大,主要还是跟当前的响应什么的挂钩。

值得注意的是用法上的区别。


协程

import asyncio
import time
import aiohttp

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

async def run(session, Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    async with session.get(Url,headers=headers) as response:
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content = await response.content.read()
            w.write(content)

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
        await asyncio.wait(tasks)

if __name__ == '__main__':
    s = time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

    print("总耗时 {} S".format(time.time() - s))

对于协程不懂的就看看基础,再来吧,此处的并发协程,速率还是有比较的明显的提升的。网络波动不大的情况下,是要比多线程要快的。


线程池协程异步

「小小的说一句,此处并没有做任何验证,只是单纯的测试了一下」 如果不对欢迎指出。

import asyncio
import time
import aiohttp
from concurrent.futures import ThreadPoolExecutor

List_Url = ['https://img.soutula.com/large/006APoFYly8hbhsbsmciuj30hs0hfaam.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhrnv8vx1j20hs0hj3zo.jpg',
            'https://img.soutula.com/large/006APoFYly8hbhm20mb6rj30hs0hmt9j.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhixn8cvvj205k0563yf.jpg',
            'https://img.soutula.com/large/ceeb653ely8hbhbzwzy4ej20hs0f1753.jpg']

async def run(session, Url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36 Edg/91.0.864.53',
        'Referer': 'https://www.fabiaoqing.com/biaoqing'
    }
    async with session.get(Url, headers=headers) as response:
        with open("爬fabiaoqing网图/" + Url[-8:-4] + '.jpg', 'wb') as w:
            content = await response.content.read()
            w.write(content)

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(run(session, Url)) for Url in List_Url]
        await asyncio.wait(tasks)

if __name__ == '__main__':
    s = time.time()

    loop = asyncio.get_event_loop()
    with ThreadPoolExecutor(max_workers=5) as pool:
        pool.submit(loop.run_until_complete(main()))

    print("总耗时 {} S".format(time.time() - s))

小结

上述代码中,请忽略用法上的冗余!

猜你喜欢

转载自blog.csdn.net/weixin_52040868/article/details/129258614
今日推荐