01- Threads and processes, concurrency and parallelism (Windows system)

Main points:

A process is the basic unit of operating system resource allocation, and a thread is the basic unit of processor task scheduling and execution.

Reference article: Python3 multiprocessing (mutiprocessing)

The difference between a thread and a process

It mainly introduces the difference between threads and processes and Python code examples. This article gives a python script to run two threads in one process. Friends who need it can refer to it. In the world of programmers, threads and processes are very important. An important concept, many people are often confused about what a thread and a process are, and what is the difference. This article tries to explain the thread and process. First look at the concept:

1.1 Process

An application that runs in memory. Each process has its own independent memory space, and a process can have multiple threads. For example, in a Windows system, a running xx.exe is a process.

English: process , is the entity of the running program in the computer. A process was once the basic unit of operation of a time-sharing system. In a process-oriented design system (such as early UNIX, Linux 2.4 and earlier versions), the process is the basic execution entity of the program ; in a thread-oriented design system (such as most contemporary operating systems, Linux 2.6 and newer versions) In , the process itself is not the basic unit of operation, but a container for threads. The program itself is just a description of instructions, data and its organizational form, and the process is the real running instance of the program (those instructions and data) .

1.2 Threads

An execution task (control unit) in a process that is responsible for the execution of programs in the current process. A process has at least one thread, a process can run multiple threads, and multiple threads can share data.

English: thread , which is the smallest unit that the operating system can perform operation scheduling . It is included in the process and is the actual operating unit in the process. A thread refers to a single sequential flow of control in a process. Multiple threads can run concurrently in a process, and each thread performs different tasks in parallel.

It is the entity where the program runs. This sentence means that the program is stored in the hard disk. When the program runs, several processes will be generated.

Different from a process, multiple threads of the same kind share the heap and method area resources of the process , but each thread has its own program counter , virtual machine stack , and local method stack , so the system generates a thread, or between each thread When switching between jobs, the burden is much smaller than that of the process, and because of this, threads are also called lightweight processes.

1.3 Summary of the difference between process and thread

Threads have many characteristics of traditional processes, so they are also called light-weight processes or process elements; and traditional processes are called heavy-weight processes, which are equivalent to tasks with only one thread. . In an operating system that introduces threads, usually a process has several threads, including at least one thread.

Fundamental difference : Process is the basic unit of operating system resource allocation , while thread is the basic unit of processor task scheduling and execution .

Resource overhead: each process has independent code and data space (program context), switching between programs will have a large overhead; threads can be regarded as lightweight processes, and the same type of thread shares code and data space. Each thread has its own independent running stack and program counter (PC), and the overhead of switching between threads is small.

Containment relationship: If there are multiple threads in a process, the execution process is not one line, but multiple lines (threads) are completed together; threads are part of the process, so threads are also called lightweight processes or lightweight processes level process.

Memory allocation: Threads of the same process share the address space and resources of the process, while the address spaces and resources between processes are independent of each other

Impact relationship: After a process crashes, it will not affect other processes in protected mode, but if a thread crashes, the entire process will die . So multiprocessing is more robust than multithreading.

Execution process: Each independent process has a program running entry, sequential execution sequence and program exit. However, threads cannot be executed independently, and must be dependent on the application program. The application program provides multiple thread execution control, and both of them can be executed concurrently.

‘

Two Python multithreading

2.1 Multithreading

In other languages, when the CPU is multi-core, it supports the simultaneous execution of multiple threads. But in Python, whether it is single-core or multi-core, only one thread can execute at the same time . The root of this is the presence of the GIL. The full name of GIL is Global Interpreter Lock (Global Interpreter Lock). The source is the consideration at the beginning of Python design and the decision made for data security. If a thread wants to execute, it must first obtain the GIL. We can regard the GIL as a "passport", and in a Python process, there is only one GIL. Threads that cannot get a pass are not allowed to enter the CPU for execution.

And because of the GIL lock, a process in Python can only execute one thread at the same time (the thread that gets the GIL can execute), which is the fundamental reason why Python's multi-threading efficiency is not high on multi-core CPUs.

2.2 Create multi-thread

Python provides two modules for multi-threaded operations, namely threadand threading, the former is a relatively low-level module for lower-level operations, and is not commonly used for general application-level development.

Method 1: Direct usethreading.Thread()

import threading
 
# 这个函数名可随便定义
def run(n):
    print("current task：", n)
 
if __name__ == "__main__":
    t1 = threading.Thread(target=run, args=("thread 1",))
    t2 = threading.Thread(target=run, args=("thread 2",))
    t1.start()
    t2.start()

Method 2: Inherit threading.Threadfrom the custom thread class and override runthe method

import threading
 
class MyThread(threading.Thread):
    def __init__(self, n):
        super(MyThread, self).__init__()  # 重构run函数必须要写
        self.n = n
 
    def run(self):
        print("current task：", n)
 
if __name__ == "__main__":
    t1 = MyThread("thread 1")
    t2 = MyThread("thread 2")
 
    t1.start()
    t2.start()

2.3 Thread Merging

joinThe function execution sequence is to execute each thread one by one, and continue to execute after the execution is completed. After the main thread ends, the child thread is still running, and jointhe function makes the main thread wait until the child thread ends before exiting.

import threading
 
def count(n):
    while n > 0:
        n -= 1
 
if __name__ == "__main__":
    t1 = threading.Thread(target=count, args=("100000",))
    t2 = threading.Thread(target=count, args=("100000",))
    t1.start()
    t2.start()
    # 将 t1 和 t2 加入到主线程中
    t1.join()
    t2.join()

2.4 Thread synchronization and mutex

Data sharing between threads. When multiple threads operate on a certain shared data, thread safety issues need to be considered. The Lockthreading class is defined in the module , which provides the function of mutex to ensure the correctness of data in the case of multi-threading.

Basic steps of usage: acquire(), release()

#创建锁
mutex = threading.Lock()
#锁定
mutex.acquire([timeout])
#释放
mutex.release()

Among them, the lock method acquire can have an optional parameter timeout for a timeout. If timeout is set, after the timeout, the return value can be used to determine whether the lock has been obtained, so that some other processing can be performed. See the sample code for specific usage:

import threading
import time
 
num = 0
mutex = threading.Lock()
 
class MyThread(threading.Thread):
    def run(self):
        global num 
        time.sleep(1)
 
        if mutex.acquire(1):  
            num = num + 1
            msg = self.name + ': num value is ' + str(num)
            print(msg)
            mutex.release()
 
if __name__ == '__main__':
    for i in range(5):
        t = MyThread()
        t.start()

2.5 Timers

If you need to specify how many seconds after a function performs an operation, you need to use Timera class. The specific usage is as follows:

from threading import Timer
 
def show():
    print("Pyhton")
 
# 指定一秒钟之后执行 show 函数
t = Timer(1, hello)
t.start()

Three Python multi-process

3.1 Create multiple processes

To perform multi-process operation in Python, you need to use muiltprocessingthe library, the classes of which are very similar Processto the classes threadingof modules Thread. So just look at the code and get familiar with multi-process.

Method 1: Use directly Process, the code is as follows:

from multiprocessing import Process  
 
def show(name):
    print("Process name is " + name)
 
if __name__ == "__main__": 
    proc = Process(target=show, args=('subprocess',))  
    proc.start()  
    proc.join()

Method 2: InheritProcess from the custom process class, rewrite runthe method, the code is as follows:

from multiprocessing import Process
import time
 
class MyProcess(Process):
    def __init__(self, name):
        super(MyProcess, self).__init__()
        self.name = name
 
    def run(self):
        print('process name :' + str(self.name))
        time.sleep(1)
 
if __name__ == '__main__':
    for i in range(3):
        p = MyProcess(i)
        p.start()
    for i in range(3):
        p.join()

3.2 Multi-process communication

No data is shared between processes. If communication between processes is required, use Queue 模块or Pipe 模块to achieve.

Queue

Queue is a multi-process safe queue, which can realize data transfer between multiple processes. It mainly has two functions putand get.

put() is used to insert data into the queue , put also has two optional parameters: blocked and timeout. If blocked is True (the default) and timeout is positive, the method blocks for the time specified by timeout until the queue has room left. If it times out, a Queue.Full exception will be thrown. If blocked is False, but the Queue is full, a Queue.Full exception will be thrown immediately.

get() can read and delete an element from the queue . Similarly get has two optional parameters: blocked and timeout. If blocked is True (the default value), and timeout is a positive value, then no elements are fetched within the waiting time, and a Queue.Empty exception will be thrown. If blocked is False, there are two cases. If the Queue has a value available, it returns the value immediately, otherwise, if the queue is empty, the Queue.Empty exception is thrown immediately.

The specific usage is as follows:

from multiprocessing import Process, Queue
 
def put(queue):
    queue.put('Queue 用法')
 
if __name__ == '__main__':
    queue = Queue()
    pro = Process(target=put, args=(queue,))
    pro.start()
    print(queue.get())   
    pro.join()

Pipe

The essence of Pipe is to use pipeline data transfer between processes, rather than data sharing, which is a bit like socket. pipe() returns two connection objects representing the two ends of the pipe, each with send() and recv() functions. If two processes try to read and write at the same end at the same time, this can corrupt the data in the pipe, as follows:

from multiprocessing import Process, Pipe
 
def show(conn):
    conn.send('Pipe 用法')
    conn.close()
 
if __name__ == '__main__':
    parent_conn, child_conn = Pipe() 
    pro = Process(target=show, args=(child_conn,))
    pro.start()
    print(parent_conn.recv())   
    pro.join()

3.3 Process pool

To create multiple processes, we don't have to create them one by one stupidly. We can use Poolmodules to do that. The commonly used methods of Pool are as follows:

See the sample code for specific usage:

#coding: utf-8
import multiprocessing
import time
 
def func(msg):
    print("msg:", msg)
    time.sleep(3)
    print("end")
 
if __name__ == "__main__":
    # 维持执行的进程总数为processes，当一个进程执行完毕后会添加新的进程进去
    pool = multiprocessing.Pool(processes = 3)
    for i in range(5):
        msg = "hello %d" %(i)
        # 非阻塞式，子进程不影响主进程的执行，会直接运行到 pool.join()
        pool.apply_async(func, (msg, ))   
 
        # 阻塞式，先执行完子进程，再执行主进程
        # pool.apply(func, (msg, ))   
 
    print("Mark~ Mark~ Mark~~~~~~~~~~~~~~~~~~~~~~")
    # 调用join之前，先调用close函数，否则会出错。
    pool.close()
    # 执行完close后不会有新的进程加入到pool,join函数等待所有子进程结束
    pool.join()   
    print("Sub-process(es) done.")

As above, after the process pool Pool is created, even if the actual number of processes to be created is far greater than the maximum limit of the process pool, the p.apply_async(test) code will continue to execute without stopping and waiting; it is equivalent to The process pool submits 10 requests, which will be put into a queue;

After executing the code p1 = Pool(5), 5 processes have been created, but they have not been assigned tasks. That is to say, no matter how many tasks there are, the actual number of processes is only 5. The computer Each time up to 5 processes in parallel .

When a process task in the pool is completed, the process resource will be released, and the pool will take out a new request for the idle process to continue execution according to the first-in-first-out principle;

When all the process tasks of the Pool are completed, 5 zombie processes will be generated. If the main thread does not end, the system will not automatically recycle resources, and the join function needs to be called to recycle .

The join function is for the main process to wait for the sub-process to finish reclaiming system resources. If there is no join, the sub-process will be forcibly killed after the main program exits regardless of whether the sub-process ends;

When creating a Pool pool , if the maximum number of processes is not specified, the number of processes created by default is the number of cores in the system.

3.4 Choose multi-thread or multi-process

On this issue, first look at what type of program your program belongs to. There are generally two types: CPU-intensive and I/O-intensive.

CPU-intensive: The program is more focused on calculations and requires frequent use of the CPU for calculations. For example, scientific computing programs, machine learning programs, etc.
I/O-intensive: As the name implies, the program requires frequent input and output operations. A crawler program is a typical I/O-intensive program.

If the program is CPU-intensive, it is recommended to use multi-process . Multithreading is more suitable for I/O-intensive programs .

Four concurrency and parallelism

Concurrency refers to the ability of the system to process multiple tasks at the same time and switch between these tasks so that each task has a chance to be executed. Concurrency is usually achieved through processes or threads in the operating system. In a concurrent system, each task usually has its own execution process, and they share system resources (such as memory and CPU time slice), so when multiple tasks are running at the same time, it is necessary to consider how to coordinate the access between them, To avoid problems such as race conditions and deadlocks.

In contrast, parallelism means that the system can execute multiple tasks at the same time, that is, multiple tasks can run on different processors at the same time. The main purpose of a parallel system is to improve system performance, as the use of multiple processors allows multiple tasks to be performed simultaneously without interfering with each other, thereby reducing processing time. Unlike concurrency, in a parallel system, each task has its own execution process and resources, and there is no need to consider resource sharing and coordination.

It should be noted that concurrency and parallelism are not completely independent concepts. In modern computer systems, concurrency and parallelism often exist simultaneously, for example, running multiple processes or threads on a multi-core CPU.