Advanced python concurrent programming

Advanced Concurrent Programming

overview:

  • multi-process development
  • Data sharing between processes
  • process lock
  • process pool
  • coroutine

1. Multi-process development

A process is the smallest unit of resource allocation in a computer; there can be multiple threads in a process, and threads in the same process share resources;

Processes are isolated from each other.

Multi-processes in Python can take advantage of the multi-core advantages of the CPU, and computing-intensive operations are suitable for multi-processes.

1.1 Process introduction

  • process module

    import multiprocessing

import multiprocessing


def task():
    pass


if __name__ == '__main__':
    p1 = multiprocessing.Process(target=task)
    p1.start()
from multiprocessing import Process

def task(arg):
	pass

def run():
    p = multiprocessing.Process(target=task, args=('xxx',))
    p.start()

if __name__ == '__main__':
    run()

Regarding processes based on multiprocessing module operations in Python:

Depending on the platform, multiprocessing supports three ways to start a process. These start methods are

- *fork*,【“拷贝”几乎所有资源】【支持文件对象/线程锁等传参】【unix】【任意位置开始】【快】

  The parent process uses [`os.fork()`](https://docs.python.org/3/library/os.html#os.fork) to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.Available on Unix only. The default on Unix.

- *spawn*,【run参数传必备资源】【不支持文件对象/线程锁等传参】【unix、win】【main代码块开始】【慢】

  The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process object’s [`run()`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.run) method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using *fork* or *forkserver*.Available on Unix and Windows. The default on Windows and macOS.

- *forkserver*,【run参数传必备资源】【不支持文件对象/线程锁等传参】【部分unix】【main代码块开始】

  When the program starts and selects the *forkserver* start method, a server process is started. From then on, whenever a new process is needed, the parent process connects to the server and requests that it fork a new process. The fork server process is single threaded so it is safe for it to use [`os.fork()`](https://docs.python.org/3/library/os.html#os.fork). No unnecessary resources are inherited.Available on Unix platforms which support passing file descriptors over Unix pipes.
import multiprocessing
multiprocessing.set_start_method("spawn")

Official documentation: https://docs.python.org/3/library/multiprocessing.html

  • Example 1
import multiprocessing
import time

"""
def task():
    print(name)
    name.append(123)


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")  # fork、spawn、forkserver
    name = []

    p1 = multiprocessing.Process(target=task)
    p1.start()

    time.sleep(2)
    print(name)  # []
"""
"""
def task():
    print(name) # [123]


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")  # fork、spawn、forkserver
    name = []
    name.append(123)

    p1 = multiprocessing.Process(target=task)
    p1.start()
"""

"""
def task():
    print(name)  # []


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")  # fork、spawn、forkserver
    name = []
    
    p1 = multiprocessing.Process(target=task)
    p1.start()

    name.append(123)
    
"""

  • Example 2:
import multiprocessing


def task():
    print(name)
    print(file_object)


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")  # fork、spawn、forkserver
    
    name = []
    file_object = open('x1.txt', mode='a+', encoding='utf-8')

    p1 = multiprocessing.Process(target=task)
    p1.start()

case:

import multiprocessing


def task():
    print(name)
    file_object.write("达莱\n")
    file_object.flush()


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")

    name = []
    file_object = open('x1.txt', mode='a+', encoding='utf-8')
    file_object.write("查苏娜\n")

    p1 = multiprocessing.Process(target=task)
    p1.start()
import multiprocessing


def task():
    print(name)
    file_object.write("达莱\n")
    file_object.flush()


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")
    
    name = []
    file_object = open('x1.txt', mode='a+', encoding='utf-8')
    file_object.write("查苏娜\n")
    file_object.flush()

    p1 = multiprocessing.Process(target=task)
    p1.start()
import multiprocessing
import threading
import time

def func():
    print("来了")
    with lock:
        print(666)
        time.sleep(1)

def task():
    # 拷贝的锁也是被申请走的状态
    # 被谁申请走了? 被子进程中的主线程申请走了
    for i in range(10):
        t = threading.Thread(target=func)
        t.start()
    time.sleep(2)
    lock.release()


if __name__ == '__main__':
    multiprocessing.set_start_method("fork")
    name = []
    lock = threading.RLock()
    lock.acquire()
    # print(lock)
    # lock.acquire() # 申请锁
    # print(lock)
    # lock.release()
    # print(lock)
    # lock.acquire()  # 申请锁
    # print(lock)

    p1 = multiprocessing.Process(target=task)
    p1.start()

1.2 Common functions

Common methods for processes:

  • p.start(), the current process is ready and waiting to be scheduled by the CPU (the unit of work is actually a thread in the process).

  • p.join(), wait for the task of the current process to be executed before continuing downwards.

import time
from multiprocessing import Process


def task(arg):
    time.sleep(2)
    print("执行中...")


if __name__ == '__main__':
    multiprocessing.set_start_method("spawn")
    p = Process(target=task, args=('xxx',))
    p.start()
    p.join()

    print("继续执行...")

p.daemon = 布尔值, the daemon process (must be placed before start)

  • p.daemon =True, set it as a daemon process, after the main process is executed, the child process will also be automatically closed.
  • p.daemon =False, set it as a non-daemon process, the main process waits for the child process, and the main process ends only after the child process finishes executing.
import time
from multiprocessing import Process


def task(arg):
    time.sleep(2)
    print("执行中...")


if __name__ == '__main__':
    multiprocessing.set_start_method("spawn")
    p = Process(target=task, args=('xxx',))
    p.daemon = True
    p.start()

    print("继续执行...")
  • Set and get the name of the process
import os
import time
import threading
import multiprocessing


def func():
    time.sleep(3)


def task(arg):
    for i in range(10):
        t = threading.Thread(target=func)
        t.start()
    print(os.getpid(), os.getppid())
    print("线程个数", len(threading.enumerate()))
    time.sleep(2)
    print("当前进程的名称:", multiprocessing.current_process().name)


if __name__ == '__main__':
    print(os.getpid())
    multiprocessing.set_start_method("spawn")
    p = multiprocessing.Process(target=task, args=('xxx',))
    p.name = "哈哈哈哈"
    p.start()

    print("继续执行...")

  • Customize the process class and directly write what the thread needs to do into the run method.
import multiprocessing


class MyProcess(multiprocessing.Process):
    def run(self):
        print('执行此进程', self._args)


if __name__ == '__main__':
    multiprocessing.set_start_method("spawn")
    p = MyProcess(args=('xxx',))
    p.start()
    print("继续执行...")

  • The number of CPUs, how many processes are generally created by the program? (Taking advantage of CPU multi-core).
import multiprocessing
multiprocessing.cpu_count()
import multiprocessing

if __name__ == '__main__':
    count = multiprocessing.cpu_count()
    for i in range(count - 1):
        p = multiprocessing.Process(target=xxxx)
        p.start()

2. Data sharing between processes

A process is the smallest unit of resource allocation. Each process maintains its own independent data and does not share it.

import multiprocessing


def task(data):
    data.append(666)


if __name__ == '__main__':
    data_list = []
    p = multiprocessing.Process(target=task, args=(data_list,))
    p.start()
    p.join()

    print("主进程:", data_list) # []

If you want to share between them, you can use some special things to achieve it.

2.1 Sharing

Shared memory

Data can be stored in a shared memory map using Value or Array. For example, the following code

    'c': ctypes.c_char,  'u': ctypes.c_wchar,
    'b': ctypes.c_byte,  'B': ctypes.c_ubyte, 
    'h': ctypes.c_short, 'H': ctypes.c_ushort,
    'i': ctypes.c_int,   'I': ctypes.c_uint,  (其u表示无符号)
    'l': ctypes.c_long,  'L': ctypes.c_ulong, 
    'f': ctypes.c_float, 'd': ctypes.c_double
from multiprocessing import Process, Value, Array


def func(n, m1, m2):
    n.value = 888
    m1.value = 'a'.encode('utf-8')
    m2.value = "达"


if __name__ == '__main__':
    num = Value('i', 666)
    v1 = Value('c')
    v2 = Value('u')

    p = Process(target=func, args=(num, v1, v2))
    p.start()
    p.join()

    print(num.value)  # 888
    print(v1.value)  # b'a'
    print(v2.value)  # 达

from multiprocessing import Process, Value, Array


def f(data_array):
    data_array[0] = 666


if __name__ == '__main__':
    arr = Array('i', [11, 22, 33, 44]) # 数组:元素类型必须是int; 只能是这么几个数据。

    p = Process(target=f, args=(arr,))
    p.start()
    p.join()

    print(arr[:])

Server process

A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.

from multiprocessing import Process, Manager

def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    l.append(666)

if __name__ == '__main__':
    with Manager() as manager:
        d = manager.dict()
        l = manager.list()

        p = Process(target=f, args=(d, l))
        p.start()
        p.join()

        print(d)
        print(l)

2.2 Exchange

multiprocessing supports two types of communication channel between processes

Tails

The Queue class is a near clone of queue.Queue. For example

import multiprocessing


def task(q):
    for i in range(10):
        q.put(i)


if __name__ == '__main__':
    queue = multiprocessing.Queue()
    
    p = multiprocessing.Process(target=task, args=(queue,))
    p.start()
    p.join()

    print("主进程")
    print(queue.get())
    print(queue.get())
    print(queue.get())
    print(queue.get())
    print(queue.get())

Pipes

The Pipe() function returns a pair of connection objects connected by a pipe which by default is duplex (two-way). For example:

import time
import multiprocessing


def task(conn):
    time.sleep(1)
    conn.send([111, 22, 33, 44])
    data = conn.recv() # 阻塞
    print("子进程接收:", data)
    time.sleep(2)


if __name__ == '__main__':
    parent_conn, child_conn = multiprocessing.Pipe()

    p = multiprocessing.Process(target=task, args=(child_conn,))
    p.start()

    info = parent_conn.recv() # 阻塞
    print("主进程接收:", info)
    parent_conn.send(666)

The above are the data sharing and exchange mechanisms between processes provided by Python. It is enough to understand, and it is rarely used in project development. In later projects, third-party resources are generally used to share resources, such as: MySQL, redis wait.

3. Process lock

If multiple processes preemptively perform certain operations, in order to prevent problems with the operation, process locks can be used to avoid them.

import time
from multiprocessing import Process, Value, Array


def func(n, ):
    n.value = n.value + 1


if __name__ == '__main__':
    num = Value('i', 0)

    for i in range(20):
        p = Process(target=func, args=(num,))
        p.start()

    time.sleep(3)
    print(num.value)
import time
from multiprocessing import Process, Manager


def f(d, ):
    d[1] += 1


if __name__ == '__main__':
    with Manager() as manager:
        d = manager.dict()
        d[1] = 0
        for i in range(20):
            p = Process(target=f, args=(d,))
            p.start()
        time.sleep(3)
        print(d)
import time
import multiprocessing


def task():
    # 假设文件中保存的内容就是一个值:10
    with open('f1.txt', mode='r', encoding='utf-8') as f:
        current_num = int(f.read())

    print("排队抢票了")
    time.sleep(1)

    current_num -= 1
    with open('f1.txt', mode='w', encoding='utf-8') as f:
        f.write(str(current_num))


if __name__ == '__main__':
    for i in range(20):
        p = multiprocessing.Process(target=task)
        p.start()

Obviously, there will be problems in the operation of multiple processes, and locks are needed to intervene at this time:

import time
import multiprocessing


def task(lock):
    print("开始")
    lock.acquire()
    # 假设文件中保存的内容就是一个值:10
    with open('f1.txt', mode='r', encoding='utf-8') as f:
        current_num = int(f.read())

    print("排队抢票了")
    time.sleep(0.5)
    current_num -= 1

    with open('f1.txt', mode='w', encoding='utf-8') as f:
        f.write(str(current_num))
    lock.release()


if __name__ == '__main__':
    multiprocessing.set_start_method("spawn")
    
    lock = multiprocessing.RLock() # 进程锁
    
    for i in range(10):
        p = multiprocessing.Process(target=task, args=(lock,))
        p.start()

    # spawn模式,需要特殊处理。
    time.sleep(7)

import time
import multiprocessing
import os


def task(lock):
    print("开始")
    lock.acquire()
    # 假设文件中保存的内容就是一个值:10
    with open('f1.txt', mode='r', encoding='utf-8') as f:
        current_num = int(f.read())

    print(os.getpid(), "排队抢票了")
    time.sleep(0.5)
    current_num -= 1

    with open('f1.txt', mode='w', encoding='utf-8') as f:
        f.write(str(current_num))
    lock.release()


if __name__ == '__main__':
    multiprocessing.set_start_method("spawn")
    lock = multiprocessing.RLock()

    process_list = []
    for i in range(10):
        p = multiprocessing.Process(target=task, args=(lock,))
        p.start()
        process_list.append(p)

    # spawn模式,需要特殊处理。
    for item in process_list:
        item.join()
import time
import multiprocessing

def task(lock):
    print("开始")
    lock.acquire()
    # 假设文件中保存的内容就是一个值:10
    with open('f1.txt', mode='r', encoding='utf-8') as f:
        current_num = int(f.read())

    print("排队抢票了")
    time.sleep(1)
    current_num -= 1

    with open('f1.txt', mode='w', encoding='utf-8') as f:
        f.write(str(current_num))
    lock.release()


if __name__ == '__main__':
    multiprocessing.set_start_method('fork')
    lock = multiprocessing.RLock()
    for i in range(10):
        p = multiprocessing.Process(target=task, args=(lock,))
        p.start()

4. Process pool

import time
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor


def task(num):
    print("执行", num)
    time.sleep(2)


if __name__ == '__main__':
    # 修改模式
    pool = ProcessPoolExecutor(4)
    for i in range(10):
        pool.submit(task, i)
	print(1)
	print(2)
import time
from concurrent.futures import ProcessPoolExecutor


def task(num):
    print("执行", num)
    time.sleep(2)


if __name__ == '__main__':

    pool = ProcessPoolExecutor(4)
    for i in range(10):
        pool.submit(task, i)
	# 等待进程池中的任务都执行完毕后,再继续往后执行。
    pool.shutdown(True)
    print(1)

import time
from concurrent.futures import ProcessPoolExecutor
import multiprocessing

def task(num):
    print("执行", num)
    time.sleep(2)
    return num


def done(res):
    print(multiprocessing.current_process())
    time.sleep(1)
    print(res.result())
    time.sleep(1)


if __name__ == '__main__':

    pool = ProcessPoolExecutor(4)
    for i in range(50):
        fur = pool.submit(task, i)
        fur.add_done_callback(done) # done的调用由主进程处理(与线程池不同)
        
    print(multiprocessing.current_process())
    pool.shutdown(True)

Note: If you want to use the process lock in the process pool, you need to implement it based on the Lock and RLock in the Manager.

import time
import multiprocessing
from concurrent.futures.process import ProcessPoolExecutor


def task(lock):
    print("开始")
    # lock.acquire()
    # lock.relase()
    with lock:
        # 假设文件中保存的内容就是一个值:10
        with open('f1.txt', mode='r', encoding='utf-8') as f:
            current_num = int(f.read())

        print("排队抢票了")
        time.sleep(1)
        current_num -= 1

        with open('f1.txt', mode='w', encoding='utf-8') as f:
            f.write(str(current_num))


if __name__ == '__main__':
    pool = ProcessPoolExecutor()
    # lock_object = multiprocessing.RLock() # 不能使用
    manager = multiprocessing.Manager()
    lock_object = manager.RLock() # Lock
    for i in range(10):
        pool.submit(task, lock_object)

Case: Calculate the daily user visits.

  • Example 1
import os
import time
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager


def task(file_name, count_dict):
    ip_set = set()
    total_count = 0
    ip_count = 0
    file_path = os.path.join("files", file_name)
    file_object = open(file_path, mode='r', encoding='utf-8')
    for line in file_object:
        if not line.strip():
            continue
        user_ip = line.split(" - -", maxsplit=1)[0].split(",")[0]
        total_count += 1
        if user_ip in ip_set:
            continue
        ip_count += 1
        ip_set.add(user_ip)
    count_dict[file_name] = {
    
    "total": total_count, 'ip': ip_count}
    time.sleep(1)


def run():
    # 根据目录读取文件并初始化字典
    """
        1.读取目录下所有的文件,每个进程处理一个文件。
    """

    pool = ProcessPoolExecutor(4)
    with Manager() as manager:
        """
        count_dict={
        	"20210322.log":{"total":10000,'ip':800},
        }
        """
        count_dict = manager.dict()
        
        for file_name in os.listdir("files"):
            pool.submit(task, file_name, count_dict)
            
            
        pool.shutdown(True)
        for k, v in count_dict.items():
            print(k, v)


if __name__ == '__main__':
    run()
  • Example 2
import os
import time
from concurrent.futures import ProcessPoolExecutor


def task(file_name):
    ip_set = set()
    total_count = 0
    ip_count = 0
    file_path = os.path.join("files", file_name)
    file_object = open(file_path, mode='r', encoding='utf-8')
    for line in file_object:
        if not line.strip():
            continue
        user_ip = line.split(" - -", maxsplit=1)[0].split(",")[0]
        total_count += 1
        if user_ip in ip_set:
            continue
        ip_count += 1
        ip_set.add(user_ip)
    time.sleep(1)
    return {
    
    "total": total_count, 'ip': ip_count}


def outer(info, file_name):
    def done(res, *args, **kwargs):
        info[file_name] = res.result()

    return done


def run():
    # 根据目录读取文件并初始化字典
    """
        1.读取目录下所有的文件,每个进程处理一个文件。
    """
    info = {
    
    }
    
    pool = ProcessPoolExecutor(4)

    for file_name in os.listdir("files"):
        fur = pool.submit(task, file_name)
        fur.add_done_callback(  outer(info, file_name)  ) # 回调函数:主进程

    pool.shutdown(True)
    for k, v in info.items():
        print(k, v)


if __name__ == '__main__':
    run()

5. Coroutines

For the time being, it's about understanding.

Provided in the computer: threads and processes are used to implement concurrent programming (real existence).

Coroutine is something that programmers create through code (not real).

协程也可以被称为微线程,是一种用户态内的上下文切换技术。
简而言之,其实就是通过一个线程实现代码块相互切换执行(来回跳着执行)。

For example:

def func1():
    print(1)
    ...
    print(2)
    
def func2():
    print(3)
    ...
    print(4)
    
func1()
func2()

The above code is a common function definition and execution, the codes in the two functions are executed according to the process, and will output successively: 1、2、3、4.

But if the coroutine technology is involved, then the function can be implemented, see the code switch execution, and finally enter: 1、3、2、4.

There are various ways to implement coroutines in Python, for example:

  • greenlet

    pip install greenlet
    
    from greenlet import greenlet
    
    def func1():
        print(1)        # 第1步:输出 1
        gr2.switch()    # 第3步:切换到 func2 函数
        print(2)        # 第6步:输出 2
        gr2.switch()    # 第7步:切换到 func2 函数,从上一次执行的位置继续向后执行
        
    def func2():
        print(3)        # 第4步:输出 3
        gr1.switch()    # 第5步:切换到 func1 函数,从上一次执行的位置继续向后执行
        print(4)        # 第8步:输出 4
        
    gr1 = greenlet(func1)
    gr2 = greenlet(func2)
    
    gr1.switch() # 第1步:去执行 func1 函数
    
  • yield

    def func1():
        yield 1
        yield from func2()
        yield 2
        
    def func2():
        yield 3
        yield 4
        
    f1 = func1()
    for item in f1:
        print(item)
    

Although both of the above implement coroutines, this way of writing code is meaningless.

This switching back and forth may actually make the execution of the program slower (compared to serial).

How can coroutines make more sense?

Don't let users switch manually, but switch automatically when encountering IO operations.

Python launched the asyncio module after 3.4 + Python 3.5 launched async and async syntax, which is internally based on coroutines and automatically switches when encountering IO requests.

import asyncio

async def func1():
    print(1)
    await asyncio.sleep(2)
    print(2)
    
async def func2():
    print(3)
    await asyncio.sleep(2)
    print(4)
    
tasks = [
    asyncio.ensure_future(func1()),
    asyncio.ensure_future(func2())
]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
"""
需要先安装:pip3 install aiohttp
"""

import aiohttp
import asyncio

async def fetch(session, url):
    print("发送请求:", url)
    async with session.get(url, verify_ssl=False) as response:
        content = await response.content.read()
        file_name = url.rsplit('_')[-1]
        with open(file_name, mode='wb') as file_object:
            file_object.write(content)
            
async def main():
    async with aiohttp.ClientSession() as session:
        url_list = [
            'https://www3.autoimg.cn/newsdfs/g26/M02/35/A9/120x90_0_autohomecar__ChsEe12AXQ6AOOH_AAFocMs8nzU621.jpg',
            'https://www2.autoimg.cn/newsdfs/g30/M01/3C/E2/120x90_0_autohomecar__ChcCSV2BBICAUntfAADjJFd6800429.jpg',
            'https://www3.autoimg.cn/newsdfs/g26/M0B/3C/65/120x90_0_autohomecar__ChcCP12BFCmAIO83AAGq7vK0sGY193.jpg'
        ]
        tasks = [asyncio.create_task(fetch(session, url)) for url in url_list]
        await asyncio.wait(tasks)
if __name__ == '__main__':
    asyncio.run(main())

Through the above content, it is found that when processing IO requests, the coroutine can implement concurrent operations through one thread.

What is the difference between coroutines, threads, and processes?

线程,是计算机中可以被cpu调度的最小单元。
进程,是计算机资源分配的最小单元(进程为线程提供资源)。
一个进程中可以有多个线程,同一个进程中的线程可以共享此进程中的资源。

由于CPython中GIL的存在:
	- 线程,适用于IO密集型操作。
    - 进程,适用于计算密集型操作。

协程,协程也可以被称为微线程,是一种用户态内的上下文切换技术,在开发中结合遇到IO自动切换,就可以通过一个线程实现并发操作。


所以,在处理IO操作时,协程比线程更加节省开销(协程的开发难度大一些)。

Many frameworks in Python now support coroutines, such as: FastAPI, Tornado, Sanic, Django 3, aiohttp, etc., and more and more enterprise developers use them (not too many at present).

Referenced articles and featured videos:

  • article
https://pythonav.com/wiki/detail/6/91/
https://zhuanlan.zhihu.com/p/137057192
  • video
https://www.bilibili.com/video/BV1NA411g7yf

Summarize

  1. Several patterns of processes to understand
  2. Master the common operations of processes and process pools
  3. Data sharing between processes
  4. process lock
  5. The difference between coroutines, processes, and threads (concept)

Guess you like

Origin blog.csdn.net/qq_37049812/article/details/119764770