The coroutine crawler of the Python small crawler quickly gets started

foreword

Reptiles are a good thing, and I'm going to use them recently, so by the way, I will send out the little things I used to do, and I will write a few blogs~

coroutine

First of all, it is clear that the thread is not multi-threaded, and the thread is essentially a single thread, but the feature of this thread is that when the current thread enters the IO state, the CPU will automatically switch tasks to improve the overall operating efficiency of the system. Yes, this coroutine is actually the same as the multiprocessing mechanism of the operating system. The effect of the implementation is somewhat similar to using multi-threading or thread pooling, but the coroutine is more lightweight, essentially a single thread switching back and forth.

Coroutines get started quickly

Then let's first understand the effect of this coroutine.
To use coroutines in python, that is, asynchronous, we need to master two keywords, await and async. Of course there is also a library that supports coroutines, asyncio.
Let's look at the code first.

import asyncio
import time
# 协程函数
async def do_some_work(x):
    print('doing: ', x)
    await asyncio.sleep(2)
    return 'done {}'.format(x)

# 协程对象
xs = [1,2,3]
# 将协程转成task,并组成list
tasks = []
start  = time.time()
for x in xs:
    c = do_some_work(x)
    tasks.append(asyncio.ensure_future(c))

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start

insert image description here

At first glance, we seem to have implemented multi-threading, so I will change the code.
insert image description here
insert image description here
Seeing that there are no 6 seconds, how can this be? If it is multi-threaded, it must be 2 seconds. So coroutines are not multi-threaded, this is the first point

Well, since it is not multi-threaded, why is it just two seconds.

Coroutines run asynchronously

This is actually very simple, before we have to talk about a method

asyncio.sleep(2)
is what is special about this, yes, the special point is that this sleep is equivalent to performing an io operation, so you understand what I mean?
This sleep is an IO operation

We started 3 asynchronous operations here, and all IOs were handed over to the system for execution, so we only spent 2 seconds in the end.

work process

Well, we've now experienced the result, so it's time to talk about why.
First of all, the first keyword async is to declare this method. The code block is an asynchronous thing, which is equivalent to stating
what await means. This is why we are the "secret" of two seconds. This thing, when it finds that the thing modified by it is a time-consuming IO operation, it will tell the operating system to perform the IO operation and let the CPU switch other tasks. In a single-threaded program, there can be multiple tasks. .

task management

We said those two keywords, then the question is, who will tell the operating system and who will do the work for me. At this time, you need to use asyncio.
insert image description here

That's right, this part of the code.

Of course, there are many ways to create coroutines, but this is used more in crawlers, so I only write this one here, in fact three. (This is somewhat similar to furtertask in java)

aiohttp

Now it is time for our asynchronous crawling. Request one, what is the crawling resource, this is actually an IO operation. So we can use async, but at this time, we can no longer use requests.

You have to use this, first download

pip install aiohttp
import asyncio
import time
import aiohttp

#随便访问三次bing主页吧
urls = ["https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1"]

async def get_page(url):
    print("开始爬取网站", url)
    #异步块,在执行异步方法的时候加上await才能切换,不然就是串行咯
    async with aiohttp.ClientSession() as session:
        async with await session.get(url) as resp:
            page = await resp.text()
    print("爬取完成->",url)
    return page

tasks = []
start  = time.time()
for url in urls:
    c = get_page(url)
    tasks.append(asyncio.ensure_future(c))

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

So what about this aiohttp thing? How to say it? There are many methods in it that are similar to requests. For example, the method just now is the same as requests.Session() (highly similar)

Asynchronous save

Having said this, naturally there are also aiofiles.

import asyncio
import time
import aiohttp
import aiofiles
#随便访问三次bing主页吧
urls = ["https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1"]

async def get_page(url):
    print("开始爬取网站", url)
    #异步块,在执行异步方法的时候加上await才能切换,不然就是串行咯
    async with aiohttp.ClientSession() as session:
        async with await session.get(url) as resp:
            page = await resp.text()
    print("爬取完成->",url)
    # async with aiofiles.open("a.html",'w',encoding='utf-8') as f:
    #     await f.write(page)
    #     await f.flush()
    #     await f.close()
    with open("a.html",'w',encoding='utf-8') as f:
        f.write(page)
        f.flush()
        f.close()
    return page

tasks = []
start  = time.time()
for url in urls:
    c = get_page(url)
    tasks.append(asyncio.ensure_future(c))

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

Let me first explain that this is not necessarily better than writing file blocks directly, and some third-party libraries do not support it!

Asynchronous callback

We now save the file asynchronously. The problem is that I want to get the result directly and then parse it. So we have to make an asynchronous callback.
insert image description here
insert image description here

import asyncio
import time
import aiohttp
import aiofiles
#随便访问三次bing主页吧
urls = ["https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1",
        "https://cn.bing.com/?FORM=Z9FD1"]

async def get_page(url):
    print("开始爬取网站", url)
    #异步块,在执行异步方法的时候加上await才能切换,不然就是串行咯
    async with aiohttp.ClientSession() as session:
        async with await session.get(url) as resp:
            page = await resp.text()
    print("爬取完成->",url)
    return page

def parse(task):
    page = task.result() #得到返回结果
    print(len(page))

tasks = []
start  = time.time()
for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    task.add_done_callback(parse)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)

Well, this is an asynchronous thing. Here is why I use the future task, because I can get the parameters. In java, it is the calllabel.
Next, there is a god-level tool, scapy, which will be updated later (see the situation, it will be Friday!)

Guess you like

Origin blog.csdn.net/FUTEROX/article/details/123284841