Reptiles - reptiles used to build high performance coroutine

Reptiles used to build high performance coroutine

I. Introduction

  Perform other IO-intensive tasks, the program often blocked because of waiting for IO. For example, the web crawler, if we use the library requests to the request, and if the site response speed is too slow, the program has been waiting for site response, leading to its crawling efficiency is very, very low. To solve this problem, we're going to explore Python method asynchronously to speed up the process of the Association, this method is very effective for IO-intensive tasks. As its application to the web crawler, crawling efficiency can be lifted even doubled. As used herein, async / await to implement, requires Python 3.5 or later.

Second, the concept Introduction

1. obstruction

  When computing resources are required for the program state it has not been suspended. Program is waiting for an operation to complete during their own can not continue to do other things, called on the operation of the program is blocked. Blocked the form: network I / O blocking, disk I / O blocking, blocking and other user input. Including the CPU context switch, the multi-core CPU is performing a context switch operation of the core can not be utilized.

2. Non-blocking

  While waiting for an action program in itself is not blocked, it can continue to do other things, called on the operation of the program is non-blocking. There is non-blocking because of the presence of obstruction, because of time-consuming and low efficiency caused by blockage of an operation, we make it into a non-blocking, to improve efficiency.

3. Synchronization

  Different program units in order to complete a task, in the implementation process need to rely on some kind of communication in a coordinated way, saying that these procedures are performed synchronously unit. Such as updating inventory merchandise shopping system, you need to use the "lock" as a communication signal, so that different update request queued to force the order of execution, it updates the inventory operation is synchronized. Synchronization means orderly.

4. Asynchronous

  Process communication without coordination between different program units, the task can complete manner, between the program element unrelated may be asynchronous. Reptile download page. After downloading the program calls the scheduler, you can schedule other tasks, without having to maintain communications to coordinate the behavior of the download task. Different pages to download, save and other operations are independent, and without notice coordinated. These asynchronous operation completion time is not determined, the asynchronous disordered.

5. coroutine

  Coroutine, also called micro-threads, shred coroutine is a user-lightweight thread state. It has its own register context and stacks. When coroutine scheduled handover, the context save registers and stack to another location, when cut back, context restore a previously saved registers and stack. Thus coroutine can be left as the last call, i.e., a particular combination of all of the local state, during each reentrant, the equivalent of entering a call state. Is essentially a single process on a coroutine, with respect to the coroutine multi-process, the thread without the overhead of context switching, locking and synchronization of atomic operations without the overhead of programming model is very simple.

  Under Web crawler scenario, we issued after a request to wait some time to get a response, but in fact in the waiting process, the program can do many other things, after the response been to wait before switching back to continue treatment, so you can take full advantage of CPU and other resources, and this is the advantage of asynchronous coroutine

6. Asynchronous coroutine asyncio

  Coroutines in Python using the most commonly used library than asyncio.  

  • event_loop: event loop, the equivalent of an infinite loop, we can register some functions to this event loop when the condition occurs, it will call the corresponding treatment.
  • coroutine: Chinese translation is called coroutines, on behalf of the Association refers to the process object types in Python often, we can register coroutine object to the cycle time, it will be calling the cycle events. We can use async keyword to define a method that will not be executed immediately when you call, but returns a coroutine object.
  • task: task, which is further encapsulated to coroutine object contains the status of each task.
  • future: a guide to future results or perform tasks not performed, and the task is actually no essential difference.

  In addition, we also need to know async / await keywords, it only appeared from Python 3.5, especially for defining the coroutine. Wherein, async define a coroutine, await a method for blocking pending execution.

Third, code implementation

1. First, we use a simple implementation Flask server

  If you do not install Flask, then you can execute the following command to install:

pip3 install flask

  Then write the server code is as follows:

from flask import Flask
import time

app = Flask(__name__)


app.route @ ( ' / ' )
 DEF index ():
     # analog Processed IO 
    the time.sleep (2 )
     return  ' Hello '


IF  __name__ == ' __main__ ' :
     # start multi-thread mode 
    app.run (threaded = True)

  Here we define a service Flask, main entrance is index () method, which method first calls the sleep () method Sleep 2 seconds, and then return the result, i.e., the interface to each request that takes at least two seconds so we simulate a slow service interface.

2. Use asyncio test

import asyncio
import requests
import time

start = time.time()


async def get(url):
    return requests.get(url)


async def request():
    url = 'http://127.0.0.1:5000'
    print('Waiting for ', url)
    response = await get(url)
    print('Get response from ', url, 'Result:', response.text)


tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print(
    'Cost time:', end - start
)

  在这里我们还是创建了五个 task,然后将 task 列表传给 wait() 方法并注册到时间循环中执行

  输出结果:

Waiting for  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Waiting for  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Waiting for  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Waiting for  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Waiting for  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Cost time: 10.043976068496704

  可以发现和正常的请求并没有什么两样,依然还是顺次执行的,耗时 10 秒,平均一个请求耗时 2 秒,说好的异步处理呢?其实,要实现异步处理,我们得先要有挂起的操作,当一个任务需要等待 IO 结果的时候,可以挂起当前任务,转而去执行其他任务,这样我们才能充分利用好资源,上面方法都是一本正经的串行走下来,连个挂起都没有,怎么可能实现异步?想太多了。要实现异步,接下来我们再了解一下 await 的用法,使用 await 可以将耗时等待的操作挂起,让出控制权。当协程执行的时候遇到 await,时间循环就会将本协程挂起,转而去执行别的协程,直到其他的协程挂起或执行完毕

  仅仅将涉及 IO 操作的代码封装到 async 修饰的方法里面是不可行的!我们必须要使用支持异步操作的请求方式才可以实现真正的异步,所以这里就需要 aiohttp 派上用场了。

3.使用aiohttp

  aiohttp 是一个支持异步请求的库,利用它和 asyncio 配合我们可以非常方便地实现异步请求操作。

  安装方式如下:

pip3 install aiohttp

  官方文档链接为:https://aiohttp.readthedocs.io/,它分为两部分,一部分是 Client,一部分是 Server,详细的内容可以参考官方文档。

  

import aiohttp
import asyncio
import time

start = time.time()


async def get(url):
    session = aiohttp.ClientSession()         # 实例化Clientsession()对象
    response = await session.get(url)         # 支持get(),post(),params/data,proxy='..'等参数
    result = await response.text()            # text()字符串,json()json类型,read()二进制
    await session.close()                     # 关闭资源,使用with语句可以自动释放
    return result




async def request():
    url = 'http://127.0.0.1:5000'
    print('Waiting fro ', url)
    # result = await get(url)
    result = await get_w(url)
    print('Get response from ', url, 'Result:', result)


tasks = [asyncio.ensure_future(request()) for _ in range(5)]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()
print('Cost time:', end - start)

  输出结果如下:

Waiting fro  http://127.0.0.1:5000
Waiting fro  http://127.0.0.1:5000
Waiting fro  http://127.0.0.1:5000
Waiting fro  http://127.0.0.1:5000
Waiting fro  http://127.0.0.1:5000
Get response from  http://127.0.0.1:5000 Result: hello
Get response from  http://127.0.0.1:5000 Result: hello
Get response from  http://127.0.0.1:5000 Result: hello
Get response from  http://127.0.0.1:5000 Result: hello
Get response from  http://127.0.0.1:5000 Result: hello
Cost time: 2.012542963027954

  开始运行时,时间循环会运行第一个 task,针对第一个 task 来说,当执行到第一个 await 跟着的 get() 方法时,它被挂起,但这个 get() 方法第一步的执行是非阻塞的,挂起之后立马被唤醒,所以立即又进入执行,创建了 ClientSession 对象,接着遇到了第二个 await,调用了 session.get() 请求方法,然后就被挂起了,由于请求需要耗时很久,所以一直没有被唤醒,好第一个 task 被挂起了,那接下来该怎么办呢?事件循环会寻找当前未被挂起的协程继续执行,于是就转而执行第二个 task 了,也是一样的流程操作,直到执行了第五个 task 的 session.get() 方法之后,全部的 task 都被挂起了。所有 task 都已经处于挂起状态,那咋办?只好等待了。2 秒之后,几个请求几乎同时都有了响应,然后几个 task 也被唤醒接着执行,输出请求结果,最后耗时,2秒!

  上面的代码也可以配合with使用,

# 使用with语句
async def get_w(rul):
    async with aiohttp.ClientSession() as session:
        async with await  session.get(rul) as response:
            result = await response.text()
            return result

4.与多进程进行结合使用aiomultiprocess,Python3.6以上版本适用

  安装方式:

pip3 install aiomultiprocess
import asyncio
import aiohttp
import time
from aiomultiprocess import Pool

start = time.time()

async def get(url):
    session = aiohttp.ClientSession()
    response = await session.get(url)
    result = await response.text()
    session.close()
    return result

async def request():
    url 'http://127.0.0.1:5000'
    urls = [url for _ in range(100)]
    async with Pool() as pool:
        result = await pool.map(get, urls)
        return result

coroutine = request()
task = asyncio.ensure_future(coroutine)
loop = asyncio.get_event_loop()
loop.run_until_complete(task)

end = time.time()
print('Cost time:', end - start)

  当然最后的耗时结果其实和异步是差不多的

  做爬取的时候遇到的情况千变万化,一方面我们使用异步协程来防止阻塞,另一方面我们使用 multiprocessing 来利用多核成倍加速,节省时间其实还是非常可观的。

 

Guess you like

Origin www.cnblogs.com/zivli/p/11657116.html