Python crawler thread

Disclaimer: Since the publication of this article, this article is for reference only and may not be reproduced or copied. If the party who browses this article is involved in any violation of national laws and regulations, all consequences shall be borne by the party who browses this article and has nothing to do with the blogger of this article. And due to the reprinting, copying and other operations of the parties who browse this article, any disputes caused by violation of national laws and regulations and all the consequences shall be borne by the parties who browse this article and have nothing to do with the blogger of this article.

1. Basic knowledge

When the crawl request process encounters blockage, high-performance asynchronous crawler is used to crawl data.

Asynchronous crawler method:

  1. Multi-threading, multi-process.
    1.1 Benefit: You can open a separate thread or process for each blocking operation to achieve asynchronous execution.
    1.2 Disadvantages: It is not possible to open multiple processes or threads indefinitely. It is expensive to open too many systems, and also Will reduce crawling speed.
  2. Thread pool, process pool. (Appropriate use)
    2.1 Benefits: Reduce the frequency of the system's creation and destruction of threads or processes, and reduce system overhead.
    2.2 Disadvantages: There is an upper limit on the number of threads or processes in the pool.
  3. Single thread + asynchronous coroutine. (Recommended)

Coroutine related knowledge points:

  1. event_loop: The event loop is equivalent to an infinite loop. Some functions can be registered to this event loop. When certain conditions are met, the function will be executed in a loop.
  2. coroutine: Coroutine objects, we can register coroutine object to the event loop, it calls the event will be recycled for use. asyncKeyword to define a method that will not be executed immediately when you call, but returns a coroutine Object.
  3. task: Task, it is a further encapsulation of the coroutine object, including the various states of the task.
  4. future: Represents tasks that will be executed in the future or have not yet been executed. In fact, there is no essential difference from task . async defines a coroutine.
  5. await: Used to suspend the execution of blocking methods.

2. Basic use of threads

Single thread serial mode, thread pool is not used.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# 导包
import time


# 发送请求
def get_page(page):
    print("开始加载: %d" % page)
    time.sleep(2)
    print("加载完成: %d" % page)


if __name__ == '__main__':
    # pages
    pages = [1, 2, 3, 4]

    # 开始时间
    start_time = time.time()

    # 开始加载
    for p in pages:
        get_page(p)

    # 结束时间
    end_time = time.time()

    # 总时间
    print("总时间: %d s" % (end_time - start_time))

Insert picture description here

Use thread pool execution

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# 导包
from multiprocessing.dummy import Pool
import time


# 发送请求
def pool_page(page):
    print("开始加载: %d" % page)
    time.sleep(2)
    print("加载完成: %d" % page)


if __name__ == '__main__':

    # 对象
    pages = 4

    # 计算运行时间
    start_time = time.time()

    # 实例化线程对象
    pool = Pool(4)
    # 将 list 的每一个元素传递给 pool_page(page) 处理
    pool_map = pool.map(pool_page, range(1, pages + 1))
    # 关闭线程
    pool.close()

    # 总时间
    print("总时间: %d s" % (time.time() - start_time))

Insert picture description here

3. Basic use of coroutine

Basic use

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import asyncio


# async 修饰的函数
async def request_url(url):
    print("正在请求 " + url)
    print("成功请求 " + url)


if __name__ == '__main__':
    # 得到协程对象
    coroutine = request_url("www.baidu.com")

    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()

    # 将协议对象注册到 loop 中, 然后启动 loop
    loop.run_until_complete(coroutine)

Insert picture description here

Basic use of tasks .

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import asyncio


# async 修饰的函数
async def request_url(url):
    print("正在请求 " + url)
    print("成功请求 " + url)


if __name__ == '__main__':
    # 得到协程对象
    coroutine = request_url("www.baidu.com")

    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()

    # 创建 task 对象
    task = loop.create_task(coroutine)
    print(task)

    # 将 task 对象注册到 loop 中, 然后启动 loop
    loop.run_until_complete(task)
    print(task)

Insert picture description here

The basic use of future tasks. It is different from task creation objects. The
difference is that task objects are created by time loop.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import asyncio


# async 修饰的函数
async def request_url(url):
    print("正在请求 " + url)
    print("成功请求 " + url)


if __name__ == '__main__':
    # 得到协程对象
    coroutine = request_url("www.baidu.com")

    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()

    # 创建 task 对象
    task = asyncio.ensure_future(coroutine)
    print(task)

    # 将 task 对象注册到 loop 中, 然后启动 loop
    loop.run_until_complete(task)
    print(task)

Insert picture description here

Basic usage of binding callback

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import asyncio


# async 修饰的函数
async def request_url(url):
    print("正在请求 " + url)
    print("成功请求 " + url)
    return url


# 回调函数
def callback(callback_task):
    # result() 方法是任务中封装的协程对象对应函数的返回值
    print(callback_task.result())


if __name__ == '__main__':
    # 得到协程对象
    coroutine = request_url("www.baidu.com")

    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()

    # 创建 task 对象
    task = asyncio.ensure_future(coroutine)

    # 将回调对象绑定在任务对象中, task 成功后完成前看是回调该函数.
    task.add_done_callback(callback)

    # 将 task 对象注册到 loop 中, 然后启动 loop
    loop.run_until_complete(task)

Insert picture description here

4. Multitasking coroutine

requires attention:

  1. If the code related to the synchronous module appears in the asynchronous coroutine, it cannot be asynchronous.
  2. When asyncio encounters a blocking operation, it must be manually suspended.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import asyncio
import time


# async 修饰的函数
async def request_url(url):
    print("正在请求 " + url)
    # 异步协程中出现同步模块相关的代码, 则无法进行异步
    # time.sleep(2)
    # 当 asyncio 遇到阻塞操作必须进行手动挂起
    await asyncio.sleep(2)
    print("成功请求 " + url)


if __name__ == '__main__':

    # 开始时间
    start_time = time.time()

    # 请求的 url
    urls = [
        "https://www.baidu.com",
        "https://www.sougou.com/",
        "https://blog.csdn.net/"
    ]

    # 任务列表
    tasks = []
    for u in urls:
        # 得到协程对象
        coroutine = request_url(u)
        # 创建任务对象
        task = asyncio.ensure_future(coroutine)
        # 添加任务对象
        tasks.append(task)

    # 创建一个事件循环对象
    loop = asyncio.get_event_loop()

    # 将 tasks 对象封装到 wait 中后注册到 loop 中, 然后启动 loop
    loop.run_until_complete(asyncio.wait(tasks))

    # 总时间
    print("总时间: %d s" % (time.time() - start_time))


Insert picture description here

Guess you like

Origin blog.csdn.net/YKenan/article/details/112002542