On the use of processes, threads, coroutines in Python

description

Recently, an artificial intelligence scheduling platform was developed in Python. Because the computing side uses python + tensorflow, the scheduling side also chose python for the heterogeneous security of the language, which involves a scheduling concurrency performance problem. Because of business needs, it needs to be able to achieve The traffic demand of 1000+ qps has a great challenge to the performance of python scheduling service.
The specific architecture is shown below:
On the use of processes, threads, coroutines in Python

Supplement:
The python used in the architecture is cpython, and the language for interpreting and executing is not jpython or pypython. The community environment of cpython is relatively active. Many development kits are implemented under cpython, such as tensorflow, numpy, etc. used in the calculation module of the project. Wait.
The following discussion is cpython language.

problem

Currently data computing services each service is responsible for parsing type data, and have had no problems, and docker calculation can be scheduled to the other machine, temporarily does not constitute a performance bottlenecks
need to be done 1000+ times per second on a scheduled business rpc server clients Once, the business includes rpc scheduling once the original data, and then rpc scheduling to the computing service for calculation and return the results and then asynchronous storage, at this time when using python to schedule rpc io using different methods will have different performance.

Project implementation and transformation as follows

Sequential execution

More than 1,000 services will be executed sequentially. If the python program has only one process, and other processes on the machine will not seize CPU resources with him (any process in any language can only use one CPU core at the same time without threading. Python language Even if a process opens a thread, it can only use at most one CPU core at the same time, or use 0, which is in a blocked state), all business code blocks are executed in sequence, and the insider code can only be executed in sequence with a CPU core. It is obviously not advisable, any CPU where io needs to wait is there to wait for execution, after executing one task to recycle and execute the next time, the performance is low, and every business is serial.

Multi-threaded execution

with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WOKERS) as executor:
        for (classify, sensorLists) in classifySensors.items():
            print(f'\ncurrent classify: {classify},  current sensors list: {sensorLists}')
            try:
                executor.submit(worker,classify,sensorLists)
            except Exception as e:
                print(e)

First, let me briefly talk about how the operating system cpu kernel schedules multi-threaded / multi-process. Multi-threaded / multi-process job scheduling is scheduled at the kernel layer of the operating system. The operating system depends on the priority of the thread / process or Whether the time slice is used up or blocked to determine whether to switch the CPU time slice to other processes / threads, some processes / threads can also preempt CPU resources according to the lock mechanism, etc. If the CPU is idle, then the current thread / process Or you can always occupy a CPU core, and then discuss how the thread in the current Python process is scheduled to run by the CPU. If the current machine has multiple cores and no other tasks are occupied, the current scheduling service process uses multiple threads / The thread pool schedules a multi-threaded business job. Due to the Python interpreted language, the execution of each thread in the current Python process is as follows:
1. Acquire GIL lock
2. Acquire cpu time slice
3. Execute code, wait for blocking (sleep or io blocking or time consuming operation) or the other thread preemption cpu time slice
4. release GIL lock switch to its Thread execution, repeat starting at step 1.
It can be seen that a thread in python wants to execute, you must first get a GIL lock, there is only one GIL lock in a python process.
Therefore, in the case of python multi-threading, because of the characteristics of the language under interpretation, there is an additional GIL lock, and multiple threads of a process can only occupy at most one CPU resource at the same time, but one thread can switch to another thread while waiting. To run, there is no need to wait for the previous thread io to complete the serial before proceeding to the next thread, so at this time multiple threads can have concurrent effects, but can not occupy multiple cpu cores at the same time, threads cannot be executed in parallel, only in one At the time of another concurrent execution, to achieve a concurrent effect, qps is better than serial execution.

Coroutine execution

For processes and threads, there is a kernel for scheduling, there is the concept of CPU time slices, preemptive scheduling (with multiple scheduling algorithms), for coroutines (user-level threads), this is transparent to the kernel, which is the system I do n’t know if there is a coroutine. It is completely scheduled by the user ’s own program. Because it is controlled by the user program, it is difficult to force the CPU control to switch to other processes / threads like preemptive scheduling. Generally, only collaborative scheduling can be performed, and other coroutines can only be executed after the coroutine transfers its control.
Python uses aysnc to define the coroutine method, the time-consuming operation in the method defines the await keyword (this is defined by the user), when the execution reaches await, you can switch to other coroutines to run, so in this python process If the coroutine method in this cpu core needs to be called multiple times, it will give the cpu time slice to other coroutines when needed through the user-defined aysnc + await Ctrip method, which also achieves the effect of concurrency. The switching overhead is relatively small, and does not involve the system kernel maintenance and maintenance threads, save the scene and other operations, so if the coroutine used in Python is good, the concurrency effect will be better than multithreading. You can refer to tornado's underlying coroutine implementation gen.coroutine decorator implementation source code to understand the underlying implementation of coroutine:
Reference:
https://note.youdao.com/ynoteshare1/index.html?id=e78ce975a872b2bdd37c7bae116790b8&type=note

The bottom layer of Python's coroutine is switched by the iterator and the next of the generator. The event is realized by the select pool provided by the operating system. However, Python's coroutine implementation can only be switched in the current process or current thread. At the same time, only one CPU core can be used at the same time, and it cannot achieve parallelism. It can only achieve concurrency. If used well, the concurrency effect is better than Python's multithreading.

Multi-process execution

First of all, the operating system's scheduling method for threads / processes mentioned earlier. If a thread in Python wants to execute, it must first get a GIL lock. There is only one GIL lock in a Python process. Then in the interpretation execution language python, multi-process mode design can be carried out. When the process is running, a part of the memory space and independent CPU core can be applied separately. The process space has a certain independence, so each process has its own independent GIL lock. , Python multi-process can achieve parallel effect, the effect of qps is much better than coroutine and multi-threading.

Multi-process + coroutine

Multi-processes can make full use of the number of CPU cores, coroutines can also mine the utilization rate of a single core, and multi-process + coroutine mode debugging may have different effects. The coroutine method is generally used in io-intensive applications, the multi-process method is used in computationally intensive scenarios, and the multiprocess + coroutine method is used in IO-intensive and computationally intensive scenarios. In my project, each task requires two io, one calculation, although the calculation is already in the python process in a separate docker, but the calculation time for each task is still serial together, it will also affect The overall qps effect, so it is best to use multi-process + coroutine at this time. The business of a type of IOT device is placed in an independent process, and the business of multiple types of IOT devices can be executed in parallel. Multiple devices in this type of iot device Can be concurrency through coroutines, the overall qps effect will be good.

go dafa goroutine

Essentially, goroutine is a coroutine. The difference is that Golang encapsulates and processes the goroutine schedule in many aspects such as runtime and system calls. When it encounters a long execution or makes a system call, it will actively transfer the CPU (P) of the current goroutine to let other goroutine Can be scheduled and executed, that is, Golang supports coroutines from the language level. A major feature of Golang is the native support for coroutines from the language level. Adding the go keyword in front of a function or method can create a coroutine.
The definition of go is more complicated. More than Python coroutines, it allows coroutines to be scheduled in different threads of the operating system.
So if you want to pursue performance, the scheduling server is refactored with golang.
go reference material:
"Golang study notes two" e-book
Golang GMP scheduling model https://blog.csdn.net/qq_37858332/article/details/100689667
Golang's coroutine detailed explanation https://blog.csdn.net/weixin_30416497/ article / details / 96665770

Distributed transformation

The above methods are all modified from the concurrency ability of the single node of the scheduling server, but if the performance requirements cannot be met under any circumstances and the performance of a single server is improved, then distributed transformation must be carried out. The distributed transformation of the computing node is relatively simple. It is directly dispatched to the same machine on k8s. The difficulty lies in the dispatch server. The dispatch server is similar to the client for rpc io request and distribution. In the memory number structure, take the service node information-> directly connected nodes to perform service requests. If you want to carry out distributed transformation, you must split the business into multiple scheduling service clients to carry out the load of rpc io requests and distribution, or According to the business dimension, split some types of iot devices through a scheduling service client, and other parts of iot devices through other scheduling service clients; or streamline the scheduling service client into a stateless request distributor, Can deploy multiple, as long as the gateway is predetermined Define the routing distribution rules and send rpc io requests to the correct server. The two differences are:
one is: multiple scheduling service clients (business distribution routing logic in this client)-> multiple computing services
another: multiple stateless request distributors (stateless, without any Business distribution routing logic)-> Business routing gateway-> Multiple computing services The
first possible scheduling service client loads different configurations, that is, the loaded business distribution routing configuration logic is different, the second is stateless The request distributor is the same, but you need to maintain a routing gateway (depending on what rpc is used, there is an official gateway like grpc, and optional Nginx matching gateway, you can dynamically add routing configuration to the gateway, and then automatically rewrite the gateway Configuration, reload) The
distributed transformation can be completed as above, transformed into multiple scheduling service clients or multiple stateless request distributors + a business routing gateway, of course, both modes require a zk or etcd for service registration and service It is found that multiple clients or distributors directly obtain the corresponding computing server resource information from zk or etcd. The specific transformation mode is as follows:
The first:
On the use of processes, threads, coroutines in Python

The second kind:

On the use of processes, threads, coroutines in Python

to sum up

1. In Python, a process has multiple threads. At most, it can only use one core at the same time. When using coroutines, it can only use one core at the same time.
2.Python multi-process can use multiple cpu cores, one pythonf It is best to run processing processes to run different services in the service system, otherwise the multi-core CPU performance cannot be used, and the concurrency capability of the service system itself cannot be achieved.
3. Other languages ​​such as java, go A process can use multiple cores at the same time when multiple threads or multiple coroutines are opened. The go language is inherently provided with coroutine attributes and can be scheduled in different threads of the operating system. Coroutine.
4. Service performance transformation either improves single-node concurrent performance. When improving performance, stress tests are performed at various stages to identify performance bottlenecks and performance improvement is targeted; or system distributed system transformation is performed and multiple servers are deployed Multiple nodes go to the load to undertake business to improve concurrency, and distributed often adds some components, causing some components to single point of failure, or service consistency, data consistency and other issues need to be considered.

Guess you like

Origin blog.51cto.com/7142665/2486598