Python multitasking: multithreading and multiprocessing


Python's multitasking has been practical for a long time, because when I first started writing code, I always read about high concurrency and asynchrony on the Internet. Some have been used for the sake of use, and some have really had to be used because of performance issues. Today I want to record some of the content I have so far.

In fact, I should introduce "Python is slow" and GIL, which are widely circulated on the Internet, but these two topics have been discussed in many articles on the Internet, so I don't want to write more.

Python multitasking actually has three implementation methods: multi-thread, multi-process and coroutine, but coroutine is generally only used in the case of particularly high performance requirements, and the implementation is more complicated than multi-thread and multi-process, so it is not in Write it here, and write a separate note for the coroutine in the future.

Applicable scenarios for multithreading and multiprocessing

One sentence summary is: multi-threading is suitable for IO-intensive code, and multi-process is suitable for CPU-intensive code.

The so-called IO-intensive refers to the large amount of data interaction involving disks, networks, databases, etc. in the code. For example, crawlers involve a large number of network requests and disk read and write operations, and remote database read and write operations also involve network requests and disk read and write operations.

The so-called CPU-intensive means that the code will perform a large number of calculations and cause a large amount of CPU usage, such as AI algorithms (the amount of calculation is so large that the CPU is not enough, the GPU must be used), or the calculation of whether a large number is a prime number (This is an example later).

Multithreading

objective function

Before actually writing multithreading, write a function as the target function of multitasking. Here I use a crawler function as the objective function.

I use cnblog as the crawling object in the code. cnblog is a very good blog site. The code is only used to display the function. If some readers want to try to run the code, I hope that they will not crawl their website too frequently, so as not to give them bring excessive request burden.

The following code is written in blog_spider.py

"""
爬取cnblog首页的的信息
"""
import requests
from bs4 import BeautifulSoup


# 定义需要爬取的url
urls = [f"https://www.cnblogs.com/sitehome/p/{
      
      page}" for page in range(1, 51)]


def craw(url):
    """爬取指定url的信息"""
    content = requests.get(url).text
    
    return content


def parse(html):
    """对给定的html进行解析"""
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    return [(link["href"], link.get_text()) for link in links]

    

Simple implementation of multithreading

In the following code, single-threaded and multi-threaded crawlers are used to compare the time-consuming to observe the performance gap.
The following code is written in multi_thread.py

"""
对比多线程和单线程在爬虫上的效率
"""
import time
import threading
from blog_spider import urls, craw


def single_thread():
    """
    单线程爬虫
    """
    for url in urls:
        craw(url)


def multi_thread():
    """
    多线程爬虫
    """
    threads = []
    for url in urls:
        # target是目标函数,args是目标函数的参数所组成的一个元组,
        threads.append(
            threading.Thread(target=craw, args=(url,))
        )
    
    # 开始线程任务
    for thread in threads:
        thread.start()
    
    # 阻塞主线程,直到所有的线程多执行完成
    for thread in threads:
        thread.join()


if __name__ == '__main__':
    start = time.time()
    single_thread()
    end = time.time()
    print("单线程耗时:%s s" % (end - start))
    
    start = time.time()
    multi_thread()
    end = time.time()
    print("多线程耗时:%s s" % (end - start))

In the above code, the function of thread.join() is to block the main thread, so that the main thread can be terminated after all the sub-threads have finished running, so as to avoid being blocked due to the end of the main thread when the sub-threads are still executing. forced to terminate.

Resource competition and thread locks in multithreading

When using multi-threading, we often encounter the problem of resource competition. For example, when multiple sub-threads calculate a variable at the same time, if it is not controlled, the final result may not be as expected.

Let's take the accumulation of a number of large orders of magnitude as an example to explain this problem.

In fact, the logic is very simple. Initialize number to 0, then for loop one million times, perform +1 operation on number each time, and then use two sub-threads to perform the same operation at the same time. The final number we expected should be equal to 2000000. However, due to resource competition issues, it is almost impossible to get the correct answer without thread locks (also called mutex locks) to control.

In the following sample code, two functions with mutex and without mutex are written at the same time

The following code is written in multi_thread_lock.py

"""
对一个数字进行多次累加,可以观察到在多线程情况下,
如果不加互斥锁,可能会出现脏数据,
plus_with_lock是加了互斥锁的,
plus_without_lock是没有互斥锁的
"""
import threading
from concurrent.futures import ThreadPoolExecutor


number_with_lock = 0
number_without_lock = 0
lock = threading.Lock()


def plus_with_lock():
    global number_with_lock
    with lock:
        for _ in range(1000000):
            number_with_lock += 1
            

def plus_without_lock():
    global number_without_lock
    for _ in range(1000000):
        number_without_lock += 1


if __name__ == '__main__':
    t1 = threading.Thread(target=plus_with_lock,)
    t2 = threading.Thread(target=plus_with_lock,)
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    print(number_with_lock)

    t3 = threading.Thread(target=plus_without_lock,)
    t4 = threading.Thread(target=plus_without_lock,)
    t3.start()
    t4.start()
    t3.join()
    t4.join()
    print(number_without_lock)

Thread Pool

According to my personal practical experience, when applying multi-threading, in most cases, the thread pool is used instead of manually controlling the behavior of each thread like the previous two cases. Using a thread pool has two advantages:

  1. Reduce performance consumption
    The action of creating a thread will consume a certain amount of resources. As above, a new sub-thread is created every time it is needed. If many sub-threads are created, it will have a certain impact on performance.
  2. Simple code
    The thread pool is relatively simple in code implementation

The following is a thread pool case with a crawler as the target function

The following code is written in multi_thread_pool.py

from blog_spider import craw, parse, urls
from concurrent.futures import ThreadPoolExecutor
import concurrent
import time


start = time.time()
with ThreadPoolExecutor(max_workers=5) as executer:
    htmls = executer.map(craw, urls)
    # map 方法
    url_html_maps = list(zip(urls, htmls))
    for url, html in url_html_maps:
        print(url)
        print(len(html))
end = time.time()
print("多线程爬虫耗时:%s " % (end - start))

with  ThreadPoolExecutor(max_workers=5) as executer:
    fs = {
    
    }
    for url, html in url_html_maps:
        future = executer.submit(parse, html)
        fs[future] = url
    
    for future in concurrent.futures.as_completed(fs):
        # as_completed的作用是当fs中有任何一个future完成的时候会先返回,而不是顺序等待
        # https://blog.csdn.net/panguangyuu/article/details/105335900 
        url = fs[future]
        print(url, future.result())

As can be seen from the above code, I prefer to use the thread/process pool with with (context manager), because it does not need to manually manage the creation and shutdown of the thread/process pool, and the code is simpler.

It can be seen that ThreadPoolExecutor has map and submit two kinds of running sub-threads. map is simpler in code and is suitable for situations where the thread does not need to be operated and managed after submission. submit is suitable for operating and managing the thread after submission. Managed operations. Personally, I feel that the map method can be used first. If the map cannot meet the requirements, consider using submit.

Multi-threaded callback function

ThreadPoolExecuter also has an add_done_callback method that is also very useful. It can add a callback function to the process. This callback function can be triggered when the thread execution is completed. For example, it can be used to send emails, DingTalk and other message notifications.

Here is a simple example

from blog_spider import craw, parse, urls
from concurrent.futures import ThreadPoolExecutor
import concurrent
import time


def notify():
    """
    模拟一个消息通知函数
    """
    pass



with  ThreadPoolExecutor(max_workers=5) as executer:
    fs = {
    
    }
    for url, html in url_html_maps:
        future = executer.submit(parse, html)
        fs[future] = url
    
    for future in concurrent.futures.as_completed(fs):
        future.add_done_callback(notify)

multi-Progress

Multi-process and multi-thread are very similar in code implementation. I usually use the process pool with with instead of manually controlling the creation and operation of each process, so I only use the process pool.

The content of the following code is relatively simple, and there are detailed comments, so I won’t explain much, just say two points:

  1. The code also compares the performance of single-threaded, multi-threaded, and multi-process in CPU-consuming scenarios.
  2. The code calls a counter that counts the execution time of the code. The starting code is as follows
"""
可以为函数计时的装饰器
"""
import time


def func_timer(function):
    """
    :param function: function that will be timed
    :return: duration
    """

    def function_timer(*args, **kwargs):
        t0 = time.time()
        result = function(*args, **kwargs)
        t1 = time.time()
        print(
            "[Function: {name} finished, spent time: {time:.4f}s]".format(
                name=function.__name__, time=t1 - t0
            )
        )
        return result

    return function_timer

The following code is written in multi_process_pool.py

"""
计算一个大数是不是一个素数,
这是一个CPU消耗型的代码,更适合多进程,
这段代码会对比单线程、多线程和多进程的性能区别
"""
"""
计算一个大数是不是一个素数,
这是一个CPU消耗型的代码,更适合多进程,
这段代码会对比单线程、多线程和多进程的性能区别
"""
import math
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
from utils.function_timer import func_timer


def is_prime(n):
    """
    判断一个数是不是素数,
    n 要能走完所有的逻辑,这样才能消耗大量的CPU,
    如果从中间某一步就结束的话,后面三中情况的对比结果可能就不是预期的那样
    """
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True


@func_timer
def single_thread(numbers):
    for number in numbers:
        is_prime(number)


@func_timer
def multi_thread(numbers):
    with ThreadPoolExecutor(max_workers=10) as executer:
        executer.map(is_prime, numbers)


@func_timer
def multi_process(numbers):
    with ProcessPoolExecutor(max_workers=10) as executer:
        executer.map(is_prime, numbers)


if __name__ == '__main__':

    numbers_1 = [112272535095293] * 50  # 这个数会导致代码消耗大量CPU
    numbers_2 = [112272535095290] * 50  # 这个数不是素数,在判断过程中就退出了,不会消耗大量CPU

    single_thread(numbers_1)
    multi_thread(numbers_1)
    multi_process(numbers_1)
    
    # 以下代码说明多进程只有在CPU消耗型的情况下才有优势
    single_thread(numbers_2)
    multi_thread(numbers_2)
    multi_process(numbers_2)

some useful documentation

In the process of learning Python multitasking, I found some documents that I personally feel very good, and there are some details in this note that have not been written. For example, the thread.join is very simple. In fact, the knowledge behind it is to protect Thread, so make another share here

  1. C Programming Network's "Python Programming"
    tutorial introduces the details and cases of python multitasking in great detail, and is highly recommended
  2. The official documentation for Python.
    After all, all other documentation comes from here.
  3. Liao Xuefeng's Python Tutorial - Processes and Threads This is a good explanation of processes and threads, but the introduction to multiple people in Python is a bit old
  4. Liu Jiang's Python Tutorial—Multi-thread and multi-process sample code is better, and there are more detailed explanations for various common methods

personal blog

This article was simultaneously published on the personal site: panzhixiang.cn

Guess you like

Origin blog.csdn.net/u013117791/article/details/123960983