并发编程(十一)python异步IO实现并发编程

并发编程专栏系列博客

并发编程(一)python并发编程简介

并发编程(二)怎样选择多线程多进程和多协程

并发编程(三)Python编程慢的罪魁祸首。全局解释器锁GIL

并发编程(四)如何使用多线程,使用多线程对爬虫程序进行修改及比较

并发编程(五)python实现生产者消费者模式多线程爬虫

并发编程(六)线程安全问题以及lock解决方案

并发编程(七)好用的线程池ThreadPoolExecutor

并发编程(八)在web服务中使用线程池加速

并发编程(九)使用多进程multiprocessing加速程序运行

并发编程(十)在Flask服务中使用进程池加速

并发编程(十一)python异步IO实现并发编程

并发编程(十二)使用subprocess启动电脑任意程序(听歌、解压缩、自动下载等等)

 
 


 

异步IO原理

单线程爬虫执行流程(执行路径)

在这里插入图片描述

从下图中我们可以看到,当第一个任务进行等待IO时,它不会像上图一样一直等待IO结束继续执行该任务,而是切换到第二个任务进行执行。直到全部执行到等待IO,再从头继续执行任务直至任务执行结束。

在这里插入图片描述

在这儿就有必要提到《the one loop》

the one loop

至尊循环驭众生

至尊循环寻众生

至尊循环引众生

普照众生欣欣荣

也就是说,在这儿单线程中他在等待IO期间会无限超级循环,执行新的任务,进而将所有资源都利用起来。也可以说是就是这个至尊循环让我们实现了IO的多路复用,真正实现了单线程内的并发执行。

 
 

python异步IO库:asyncio
import asyncio

# 获取事件循环
loop = asyncio.get_event_loop()

# 定义协程
async def myfunc(url):
    await get_url(url)
    
# 创建task列表
tasks = [loop.create_task(myfunc(url)) for url in urls]

# 执行事件(任务)列表
loop.run_until_complete(asyncio.wait(tasks))

注意:

  • 要用在异步IO编程中,依赖的库必须支持异步IO特性
  • 比如:爬虫中的requests库就不支持异步,这儿就需要换成aiohttp库

代码:

# -*- coding: utf-8 -*-
# @Time    : 2021-03-22 17:20:27
# @Author  : wlq
# @FileName: async_test.py
# @Email   :[email protected]
# 导包
import asyncio
import aiohttp
import time

# 待爬取链接
urls = [
    f"https://w.cnblogs.com/#p{page}"
    for page in range(1, 51)
]

# 定义协程(超级循环内执行函数)
async def async_craw(url):
    print("craw url:", url)
    # 创建对象
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            # 获取结果
            rst = await resp.text()
            print(url, len(rst))

# 定义超级循环
loop = asyncio.get_event_loop()

# 定义task列表
tasks = [loop.create_task(async_craw(url)) for url in urls]

start = time.time()
# 执行等待task列表完成
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print("use time:", end - start)


'''
output:
craw url: https://w.cnblogs.com/#p1
craw url: https://w.cnblogs.com/#p2
craw url: https://w.cnblogs.com/#p3
craw url: https://w.cnblogs.com/#p4
craw url: https://w.cnblogs.com/#p5
craw url: https://w.cnblogs.com/#p6
craw url: https://w.cnblogs.com/#p7
craw url: https://w.cnblogs.com/#p8
craw url: https://w.cnblogs.com/#p9
craw url: https://w.cnblogs.com/#p10
craw url: https://w.cnblogs.com/#p11
craw url: https://w.cnblogs.com/#p12
craw url: https://w.cnblogs.com/#p13
craw url: https://w.cnblogs.com/#p14
craw url: https://w.cnblogs.com/#p15
craw url: https://w.cnblogs.com/#p16
craw url: https://w.cnblogs.com/#p17
craw url: https://w.cnblogs.com/#p18
craw url: https://w.cnblogs.com/#p19
craw url: https://w.cnblogs.com/#p20
craw url: https://w.cnblogs.com/#p21
craw url: https://w.cnblogs.com/#p22
craw url: https://w.cnblogs.com/#p23
craw url: https://w.cnblogs.com/#p24
craw url: https://w.cnblogs.com/#p25
craw url: https://w.cnblogs.com/#p26
craw url: https://w.cnblogs.com/#p27
craw url: https://w.cnblogs.com/#p28
craw url: https://w.cnblogs.com/#p29
craw url: https://w.cnblogs.com/#p30
craw url: https://w.cnblogs.com/#p31
craw url: https://w.cnblogs.com/#p32
craw url: https://w.cnblogs.com/#p33
craw url: https://w.cnblogs.com/#p34
craw url: https://w.cnblogs.com/#p35
craw url: https://w.cnblogs.com/#p36
craw url: https://w.cnblogs.com/#p37
craw url: https://w.cnblogs.com/#p38
craw url: https://w.cnblogs.com/#p39
craw url: https://w.cnblogs.com/#p40
craw url: https://w.cnblogs.com/#p41
craw url: https://w.cnblogs.com/#p42
craw url: https://w.cnblogs.com/#p43
craw url: https://w.cnblogs.com/#p44
craw url: https://w.cnblogs.com/#p45
craw url: https://w.cnblogs.com/#p46
craw url: https://w.cnblogs.com/#p47
craw url: https://w.cnblogs.com/#p48
craw url: https://w.cnblogs.com/#p49
craw url: https://w.cnblogs.com/#p50
https://w.cnblogs.com/#p9 70107
https://w.cnblogs.com/#p16 70107
https://w.cnblogs.com/#p10 70107
https://w.cnblogs.com/#p2 70107
https://w.cnblogs.com/#p4 70107
https://w.cnblogs.com/#p18 70107
https://w.cnblogs.com/#p5 70107
https://w.cnblogs.com/#p3 70107
https://w.cnblogs.com/#p13 70107
https://w.cnblogs.com/#p32 70107
https://w.cnblogs.com/#p7 70107
https://w.cnblogs.com/#p12 70107
https://w.cnblogs.com/#p43 70107
https://w.cnblogs.com/#p21 70107
https://w.cnblogs.com/#p15 70107
https://w.cnblogs.com/#p8 70107
https://w.cnblogs.com/#p22 70107
https://w.cnblogs.com/#p26 70107
https://w.cnblogs.com/#p37 70107
https://w.cnblogs.com/#p17 70107
https://w.cnblogs.com/#p19 70107
https://w.cnblogs.com/#p28 70107
https://w.cnblogs.com/#p1 70107
https://w.cnblogs.com/#p14 70107
https://w.cnblogs.com/#p45 70107
https://w.cnblogs.com/#p42 70107
https://w.cnblogs.com/#p34 70107
https://w.cnblogs.com/#p6 70107
https://w.cnblogs.com/#p36 70107
https://w.cnblogs.com/#p31 70107
https://w.cnblogs.com/#p39 70107
https://w.cnblogs.com/#p24 70107
https://w.cnblogs.com/#p50 70107
https://w.cnblogs.com/#p40 70107
https://w.cnblogs.com/#p29 70107
https://w.cnblogs.com/#p30 70107
https://w.cnblogs.com/#p48 70107
https://w.cnblogs.com/#p46 70107
https://w.cnblogs.com/#p27 70107
https://w.cnblogs.com/#p38 70107
https://w.cnblogs.com/#p25 70107
https://w.cnblogs.com/#p44 70107
https://w.cnblogs.com/#p23 70107
https://w.cnblogs.com/#p49 70107
https://w.cnblogs.com/#p47 70107
https://w.cnblogs.com/#p11 70107
https://w.cnblogs.com/#p35 70107
https://w.cnblogs.com/#p20 70107
https://w.cnblogs.com/#p41 70107
https://w.cnblogs.com/#p33 70107
use time: 0.3919532299041748
'''

在这里插入图片描述

如上图,从并发编程系列博客四中,我们得出单线程和多线程的爬取网页的执行时间,现在我们用了单线程异步IO执行,可以看出单线程异步IO是速度最快的一个(因为灭有切换线程的开销)。

 
 

信号量(Semaphore)的使用
  • 信号量又称旗语,它是一个同步对象,用于保持在0至最大值之间的一个计数值。

  • 使用过程、原理

    • 当线程完成一次对该semaphore对象的等待(wait)时,该计数值减一;
    • 当线程完成一次对该semaphore对象的释放(release)时,计数值加一;
    • 当计数值为0,则线程等待该semaphore对象不再能成功直至该semaphore对象变成signaled状态;
    • semaphore对象的计数值大于0,为signaled状态,计数值等于0,为nonsignaled状态。
  • semaphore使用语法:

    # 使用方式一:
    sem = asyncio.Semaphore(10)
    
    # ..later
    async with sem:
        # work with shared resource
        
    
    # 使用方式二:
    sem = asyncio.Semaphore(10)
    
    # ..later
    await sem.acquire():
    try:
        # work with shared resource
    finally:
        sem.release()
    
  • 示例(基于上面爬虫代码进行修改)

    # -*- coding: utf-8 -*-
    # @Time    : 2021-03-22 17:20:27
    # @Author  : wlq
    # @FileName: async_test.py
    # @Email   :[email protected]
    # 导包
    import asyncio
    import aiohttp
    import time
    
    # 初始化信号量
    semaphore = asyncio.Semaphore(10)
    
    
    # 待爬取链接
    urls = [
        f"https://w.cnblogs.com/#p{page}"
        for page in range(1, 51)
    ]
    
    # 定义协程(超级循环内执行函数)
    async def async_craw(url):
        async with semaphore:
            print("craw url:", url)
            # 创建对象
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as resp:
                    # 获取结果
                    rst = await resp.text()
                    await asyncio.sleep(5)
                    print(url, len(rst))
    
    # 定义超级循环
    loop = asyncio.get_event_loop()
    
    # 定义task列表
    tasks = [loop.create_task(async_craw(url)) for url in urls]
    
    start = time.time()
    # 执行等待task列表完成
    loop.run_until_complete(asyncio.wait(tasks))
    end = time.time()
    print("use time:", end - start)
    
    
    '''
    output:
    craw url: https://w.cnblogs.com/#p1
    craw url: https://w.cnblogs.com/#p2
    craw url: https://w.cnblogs.com/#p3
    craw url: https://w.cnblogs.com/#p4
    craw url: https://w.cnblogs.com/#p5
    craw url: https://w.cnblogs.com/#p6
    craw url: https://w.cnblogs.com/#p7
    craw url: https://w.cnblogs.com/#p8
    craw url: https://w.cnblogs.com/#p9
    craw url: https://w.cnblogs.com/#p10
    https://w.cnblogs.com/#p4 70111
    https://w.cnblogs.com/#p8 70111
    https://w.cnblogs.com/#p6 70111
    https://w.cnblogs.com/#p10 70111
    https://w.cnblogs.com/#p3 70111
    https://w.cnblogs.com/#p2 70111
    https://w.cnblogs.com/#p5 70111
    https://w.cnblogs.com/#p1 70111
    https://w.cnblogs.com/#p7 70111
    craw url: https://w.cnblogs.com/#p11
    craw url: https://w.cnblogs.com/#p12
    craw url: https://w.cnblogs.com/#p13
    craw url: https://w.cnblogs.com/#p14
    craw url: https://w.cnblogs.com/#p15
    craw url: https://w.cnblogs.com/#p16
    craw url: https://w.cnblogs.com/#p17
    craw url: https://w.cnblogs.com/#p18
    craw url: https://w.cnblogs.com/#p19
    https://w.cnblogs.com/#p9 70111
    craw url: https://w.cnblogs.com/#p20
    https://w.cnblogs.com/#p13 70111
    https://w.cnblogs.com/#p15 70111
    craw url: https://w.cnblogs.com/#p21
    craw url: https://w.cnblogs.com/#p22
    https://w.cnblogs.com/#p18 70111
    https://w.cnblogs.com/#p16 70111
    https://w.cnblogs.com/#p19 70111
    https://w.cnblogs.com/#p17 70111
    https://w.cnblogs.com/#p11 70111
    https://w.cnblogs.com/#p14 70111
    craw url: https://w.cnblogs.com/#p23
    craw url: https://w.cnblogs.com/#p24
    craw url: https://w.cnblogs.com/#p25
    craw url: https://w.cnblogs.com/#p26
    craw url: https://w.cnblogs.com/#p27
    craw url: https://w.cnblogs.com/#p28
    https://w.cnblogs.com/#p12 70111
    craw url: https://w.cnblogs.com/#p29
    https://w.cnblogs.com/#p20 70111
    craw url: https://w.cnblogs.com/#p30
    https://w.cnblogs.com/#p22 70111
    craw url: https://w.cnblogs.com/#p31
    https://w.cnblogs.com/#p27 70111
    craw url: https://w.cnblogs.com/#p32
    https://w.cnblogs.com/#p21 70111
    https://w.cnblogs.com/#p29 70111
    https://w.cnblogs.com/#p25 70111
    https://w.cnblogs.com/#p24 70111
    https://w.cnblogs.com/#p26 70111
    https://w.cnblogs.com/#p28 70111
    https://w.cnblogs.com/#p23 70111
    craw url: https://w.cnblogs.com/#p33
    craw url: https://w.cnblogs.com/#p34
    craw url: https://w.cnblogs.com/#p35
    craw url: https://w.cnblogs.com/#p36
    craw url: https://w.cnblogs.com/#p37
    craw url: https://w.cnblogs.com/#p38
    craw url: https://w.cnblogs.com/#p39
    https://w.cnblogs.com/#p30 70111
    craw url: https://w.cnblogs.com/#p40
    https://w.cnblogs.com/#p31 70111
    craw url: https://w.cnblogs.com/#p41
    https://w.cnblogs.com/#p32 70111
    https://w.cnblogs.com/#p38 70111
    craw url: https://w.cnblogs.com/#p42
    craw url: https://w.cnblogs.com/#p43
    https://w.cnblogs.com/#p37 70111
    https://w.cnblogs.com/#p34 70111
    https://w.cnblogs.com/#p35 70111
    https://w.cnblogs.com/#p36 70111
    https://w.cnblogs.com/#p39 70111
    https://w.cnblogs.com/#p33 70111
    craw url: https://w.cnblogs.com/#p44
    craw url: https://w.cnblogs.com/#p45
    craw url: https://w.cnblogs.com/#p46
    craw url: https://w.cnblogs.com/#p47
    craw url: https://w.cnblogs.com/#p48
    craw url: https://w.cnblogs.com/#p49
    https://w.cnblogs.com/#p40 70111
    craw url: https://w.cnblogs.com/#p50
    https://w.cnblogs.com/#p41 70111
    https://w.cnblogs.com/#p42 70111
    https://w.cnblogs.com/#p43 70111
    https://w.cnblogs.com/#p44 70111
    https://w.cnblogs.com/#p45 70111
    https://w.cnblogs.com/#p48 70111
    https://w.cnblogs.com/#p49 70111
    https://w.cnblogs.com/#p50 70111
    https://w.cnblogs.com/#p46 70111
    https://w.cnblogs.com/#p47 70111
    use time: 26.094701528549194
    '''
    
    
    因为代码中加了5秒等待时间,且信号量最大是10.所以刚开始就只取前十个进行等待、处理。
    
    

猜你喜欢

转载自blog.csdn.net/qq_42546127/article/details/115183616