Multithreading and multiprocessing in python and their differences

Table of contents

introduction

1 multi-process

1.1 fork method

 1.2 multiprocessing method

1.3 Pool method

1.4 Inter-process communication

1 Queue

2 Pipe

2 multithreading

2.1 threading

2.2 Thread synchronization

2.3 Deadlock and recursive lock

1 deadlock

2 recursive lock


introduction

For novices, you must first understand the concept of threads and why you need to use multithreading for programming. What is a thread? It is generally defined on the Internet as follows: a thread is the smallest unit that the operating system can perform calculation scheduling. It is included in the process and is the actual operating unit in the process. Is it dumbfounded to hear it? I think this definition is purely self-talk. Novices look dumbfounded after reading it. So we can explain it in plain language:

  • Suppose you run a property management company. In the beginning, the business volume was very small, and you had to do everything by yourself. After repairing the heating pipes for the old Wang’s house next door, you immediately went to Lao Li’s house to change the light bulbs—this is called single-threaded, and all the work had to be executed sequentially ;
  • Later, when the business expanded, you hired a few workers, so that your property company can provide services to multiple customers at the same time-this is called multi-threading , and you are the main thread ;
  • The tools used by the workers are provided by the property management company and shared by everyone—this is called multi-threaded resource sharing ;
  • Workers need pipe wrench in their work, but only one pipe wrench - this is called conflict .
  • There are many ways to resolve conflicts, such as queuing, phone notifications after colleagues run out, etc. - this is called thread synchronization ;
  • When the business is not busy, you drink tea in the office. When the off-duty time comes, you send a group of WeChat messages, and all the workers immediately leave their tools and leave regardless of whether the work at hand is completed or not. Therefore, if necessary, it is necessary to avoid sending off-duty notifications when workers are busy-this is called thread guard attribute setting and management ;
  • Later, your company expanded in scale and served many living communities at the same time. You set up branch offices in each living community, and the branch offices were managed by branch managers. The operating mechanism was almost exactly the same as that of your head office—this is called Multi-process , the head office is called the main process, and the branch is called the sub-process;
  • Between the head office and each branch, the tools are independent and cannot be borrowed or mixed - this is called that resources cannot be shared between processes . Branches can be connected through dedicated telephone lines - this is called the pipeline . Branches can also exchange information through the company bulletin board - this is called shared memory ;
  • The branch office can go off work with the head office, or it can go off work after finishing all the work of the day-this is called daemon setting .

1 multi-process

There are two main ways to implement multi-processing in python: the fork method under the os module and the multiprocessing module. The former is only applicable to Unix/Linux systems, while the latter is a cross-platform implementation.

1.1 fork method

import os

# 注意,fork函数,只在Unix/Linux/Mac上运行,windows不可以
pid = os.fork()

if pid == 0:
    print('哈哈1')
else:
    print('哈哈2')

Note: the fork() function can only run on Unix/Linux/Mac, not on Windows.

illustrate:

  • When the program executes to os.fork(), the operating system will create a new process (child process), and then copy all the information of the parent process to the child process;
  • Then both the parent process and the child process will get a return value from the fork() function. This value must be 0 in the child process, and the id number of the child process in the parent process.

In the Unix/Linux operating system, a fork() system function is provided, which is very special. Ordinary function calls call once and return once, but fork() calls once and returns twice, because the operating system automatically copies the current process (called the parent process) (called the child process), and then, respectively, in Returned in the parent process and in the child process. The child process always returns 0, while the parent process returns the ID of the child process.

The reason for this is that a parent process can fork many child processes, so the parent process must record the ID of each child process, and the child process only needs to call getppid() to get the ID of the parent process. We can get the current process ID through os.getpid(), and get the parent process ID through os.getppid().

So, is there an order in the execution between the parent and child processes? The answer is no! It all depends on the scheduling algorithm of the operating system.

And multiple fork() will generate a tree structure:

 1.2 multiprocessing method

multiprocessing provides a Process class to describe a process object. When creating a child process, you only need to pass in an execution function and function parameters; use the start () method to start the process, and use the join () method to achieve synchronization between processes:

import os
from multiprocessing import Process

def run_proc(name):
    print("Child process (%s) (%s) running..." % (name, os.getpid()))
    

if __name__ == '__main__':
    print("Current process (%s) start..." % (os.getpid()))
    for i in range(5):
        p = Process(target=run_proc, args=str(i))
        print("Process will start.")
        p.start()
    p.join()
    print("Process end.")

The output is:

Current process (26811) start...
Process will start.
Process will start.
Process will start.
Child process (0) (26872) running...
Process will start.
Child process (1) (26874) running...
Process will start.
Child process (2) (26876) running...
Child process (3) (26882) running...
Child process (4) (26885) running...
Process end.

1.3 Pool method

The disadvantage of using the Process method is that it needs to start a large number of child processes, which can only be applied to the case where the number of objects to be operated is not large, and using Pool can solve this problem. Simply put, Pool can specify the number of processes, the default is the number of CPU cores, and at most the specified number of processes can be executed at the same time:

import os, time, random
from multiprocessing import Pool

def run_task(name):
    print("Task %s (pid = %s) is running..." % (name, os.getpid()))
    time.sleep(random.random() * 3)
    print("Task %s end." % name)

if __name__ == '__main__':
    print("Current process (%s) start..." % (os.getpid()))
    p = Pool(processes=3)
    for i in range(5):
        p.apply_async(run_task, args=(i, ))
    print("Waiting for all subprocess done...")
    p.close()
    p.join()
    print("All subprocess done.")

The output is:

Current process (4202) start...
Waiting for all subprocess done...
Task 0 (pid = 4255) is running...
Task 1 (pid = 4256) is running...
Task 2 (pid = 4257) is running...
Task 0 end.
Task 3 (pid = 4255) is running...
Task 2 end.
Task 4 (pid = 4257) is running...
Task 4 end.
Task 1 end.
Task 3 end.
All subprocess done.

It should be noted that calling the join () method of the Pool object will wait for all child processes to finish executing, and close () must be called before calling join (); after calling close (), new Processes cannot be added.

1.4 Inter-process communication

Python provides a variety of inter-process communication methods. This article mainly talks about Queue and Pipe.

1 Queue

It is mainly used for communication between multiple processes, and the operation is as follows:

  • When instantiating the Queue class, the maximum number of messages can be passed, such as q = Queue(5), this code means that only a maximum of 5 message data are allowed in the message queue. If the maximum number of messages is not added or the number is negative, the expression does not limit the number until the memory is full; 
  • Queue.qsize(): returns the number of messages contained in the current queue; 
  • Queue.empty(): If the queue is empty, return True, otherwise False; 
  • Queue.full(): If the queue is full, return True, otherwise False; 
  • Queue.get([block[, timeout]]): Get a message in the queue, and then remove it from the queue, the default value of block is True; 
    • If the block uses the default value and does not set timeout (in seconds), if the message queue is empty, the program will be blocked (stopped in the read state) until the message is read from the message queue. If timeout is set, then It will wait for timeout seconds, and if no message has been read, a "Queue.Empty" exception will be thrown; 
    • If the block value is False, if the message queue is empty, a "Queue.Empty" exception will be thrown immediately; 
  •  Queue.get_nowait():相当Queue.get(False); 
  • Queue.put(item,[block[, timeout]]): write the item message to the queue, the default value of block is True;
    • If the block uses the default value and timeout (in seconds) is not set, if there is no space to write in the message queue, the program will be blocked (stopped in the writing state) until there is room for the message queue, if set If the timeout is set, it will wait for timeout seconds, and if there is no room, a "Queue.Full" exception will be thrown; 
    • If the block value is False, if there is no space to write in the message queue, a "Queue.Full" exception will be thrown immediately; 
  • Queue.put_nowait(item):相当Queue.put(item, False);

  In order to better illustrate the use of Queue for communication, an example is as follows: Create three child processes in the parent process, two of which write data to the Queue, and the other child process reads data from the Queue.

import os, time
from multiprocessing import Process, Queue

"""Write to process."""
def proc_write(q, urls):
    print("Process (%d) is writing..." % os.getpid())
    for url in urls:
        q.put(url)
        print('Put %s to queue...' % url)
        time.sleep(0.1)

"""Read form process."""    
def proc_read(q):
    print("Process (%d) is reading..." % (os.getpid()))
    while True:
        url = q.get(True)
        print("Get %s from queue." % url)

if __name__ == '__main__':
    # 创建父进程
    q = Queue()
    writer1 = Process(target=proc_write, args=(q, ['张飞', '黄忠', "孙尚香"]))
    writer2 = Process(target=proc_write, args=(q, ['马超', '关羽', "赵云"]))
    reader = Process(target=proc_read, args=(q, ))
    # 启动
    writer1.start()
    writer2.start()
    reader.start()
    writer1.join()
    writer2.join()
    # 读操作是死循环,必须强行终止
    reader.terminate()

The output is:

Process (19307) is writing...
Put 张飞 to queue...
Process (19308) is writing...
Put 马超 to queue...
Process (19309) is reading...
Get 张飞 from queue.
Get 马超 from queue.
Put 黄忠 to queue...
Get 黄忠 from queue.
Put 关羽 to queue...
Get 关羽 from queue.
Put 孙尚香 to queue...
Get 孙尚香 from queue.
Put 赵云 to queue...
Get 赵云 from queue

2 Pipe

Pipe() returns two connection objects representing the two ends of the Pipe , and each connection object has a send() method and a recv() method.

However, if two processes or thread objects read or write data at both ends of the pipe at the same time, the data in the pipe may be damaged. There is no risk of data corruption when processes are using different data at both ends of the pipe.

import os, time
from multiprocessing import Process, Pipe

def proc_send(p, urls):
    print("Process (%d) is sending..." % os.getpid())
    for url in urls:
        p.send(url)
        print('Send %s...' % url)
        time.sleep(0.1)

def proc_recv(p):
    print("Process (%d) is receiving..." % (os.getpid()))
    while True:
        print("Receive %s" % p.recv())
        time.sleep(0.1)

if __name__ == '__main__':
    # 创建父进程
    p = Pipe()
    p1 = Process(target=proc_send, args=(p[0], ['张飞' + str(i) for i in range(3)]))
    p2 = Process(target=proc_recv, args=(p[1], ))
    p1.start()
    p2.start()
    p1.join()
    p2.join()

The output is:

Process (39203) is sending...
Send 张飞0...
Process (39204) is receiving...
Receive 张飞0
Send 张飞1...
Receive 张飞1
Send 张飞2...
Receive 张飞2

2 multithreading

Multithreading is similar to executing multiple different programs and has the following advantages:

  • Tasks can be processed in the background
  • The user interface could be more attractive, such as a progress bar
  • Potentially speeds up the program
  • In some tasks that need to wait, such as user input, file reading and writing, network sending and receiving data, etc., some resources can be released, such as memory usage.

Python's standard library provides two modules: thread and threading. thread is a low-level module, and threading is a high-level module that encapsulates thread. In most cases, we only need to use the advanced module threading. More methods of threading.Thread:

  • start: The thread is ready, waiting for CPU scheduling
  • setName: set the name for the thread
  • getName: Get the thread name
  • setDaemon: set to background thread or foreground thread (default); if it is a background thread, the background thread is also running during the execution of the main thread, and after the main thread is executed, the background thread will stop whether it is successful or not; if it is a foreground thread , during the execution of the main thread, the foreground thread is also in progress. After the main thread is executed, wait for the execution of the foreground thread to complete, and the program stops
  • join: Execute each thread one by one, and continue to execute after execution. This method makes multithreading meaningless
  • run: After the thread is scheduled by the CPU, the run method of the thread object is automatically executed
  • Lock: thread lock (Mutex)
  • Event

2.1 threading

Method 1: Pass in a function and create a Thread instance, then start() to run:

import time, threading

def thread_run(urls):
    print("Current %s is running..." % threading.current_thread().name)
    for url in urls:
        print("%s --->>> %s" % (threading.current_thread().name, url))
        time.sleep(0.1)
    print("%s ended." % threading.current_thread().name)

if __name__ == '__main__':
    print("%s is running..." % threading.current_thread().name)
    t1 = threading.Thread(target=thread_run, name='t1', args=(['唐僧', '孙悟空', '猪八戒'],))
    t2 = threading.Thread(target=thread_run, name='t2', args=(['张飞', '关于', '刘备'],))
    t1.start()
    t2.start()
    t1.join()
    t1.join()
    print("%s ended." % threading.current_thread().name)

The output is: 

MainThread is running...
Current t1 is running...Current t2 is running...

t2 --->>> 张飞
t1 --->>> 唐僧
t2 --->>> 关于t1 --->>> 孙悟空

t2 --->>> 刘备t1 --->>> 猪八戒

t1 ended.t2 ended.

MainThread ended.

Method 2: Inherit from threading.Thread  and create a thread class, then override the __init__ method and the run method:

import time, threading

class testThread(threading.Thread):

    def __init__(self, name, urls):
        threading.Thread.__init__(self, name=name)
        self.urls = urls

    def run(self):
        print("Current %s is running..." % threading.current_thread().name)
        for url in self.urls:
            print("%s --->>> %s" % (threading.current_thread().name, url))
            time.sleep(0.1)
        print("%s ended." % threading.current_thread().name)

if __name__ == '__main__':
    print("%s is running..." % threading.current_thread().name)
    t1 = testThread(name='t1', urls=['唐僧', '孙悟空', '猪八戒'])
    t2 = testThread(name='t2', urls=['张飞', '关于', '刘备'])
    t1.start()
    t2.start()
    t1.join()
    t1.join()
    print("%s ended." % threading.current_thread().name)

The output is:

MainThread is running...
Current t1 is running...Current t2 is running...
t1 --->>> 唐僧

t2 --->>> 张飞
t1 --->>> 孙悟空t2 --->>> 关于

t1 --->>> 猪八戒t2 --->>> 刘备

t1 ended.t2 ended.

MainThread ended.

2.2 Thread synchronization

If multiple threads jointly modify a certain data, unpredictable results may occur. In order to ensure the correctness of the data, multiple threads need to be synchronized. Specific instructions:

  • Simple thread synchronization can be achieved by using the Lock and RLock of the Thread object, both of which have acquire() and release() methods;
  • For data that only allows one thread to operate at a time, its operation can be placed between acquire() and release().
  • For the Lock object, if a thread performs the acquire() operation twice in a row, the thread will be deadlocked;
  • The RLock object allows a thread to perform acquire() operations multiple times in a row, because it maintains the number of acquire() internally through a counter variable;
  • Each acquire() object must have a release() corresponding to it;
  • After all release() operations are completed, other threads can apply for the RLock object.
import threading

test_lock = threading.RLock()
num = 0

class testThread(threading.Thread):
    def __init__(self, name):
        threading.Thread.__init__(self, name=name)

    def run(self):
        global num
        while True:
            test_lock.acquire()
            print("%s locked, Number: %d" % (threading.current_thread().name, num))
            if num >= 4:
                test_lock.release()
                print("%s released, Number: %d" % (threading.current_thread().name, num))
                break
            num += 1
            print("%s released, Number: %d" % (threading.current_thread().name, num))
            test_lock.release()

if __name__ == '__main__':
    t1 = testThread('孙悟空先上')
    t2 = testThread("终于到八戒了")
    t1.start()
    t2.start()

The output is:

孙悟空先上 locked, Number: 0
孙悟空先上 released, Number: 1
孙悟空先上 locked, Number: 1
孙悟空先上 released, Number: 2
孙悟空先上 locked, Number: 2
孙悟空先上 released, Number: 3
孙悟空先上 locked, Number: 3
孙悟空先上 released, Number: 4
孙悟空先上 locked, Number: 4
孙悟空先上 released, Number: 4终于到八戒了 locked, Number: 4

终于到八戒了 released, Number: 4

2.3  Deadlock and recursive lock

1 deadlock

The so-called deadlock: It refers to a phenomenon in which two or more processes or threads wait for each other due to competition for resources during the execution process. If there is no external force, they will not be able to advance. At this time, the system is said to be in a deadlock state or a deadlock has occurred in the system, and these processes that are always waiting for each other are called deadlock processes. The example shown below is a deadlock:

from threading import Thread,Lock
import time

mutexA = Lock()
mutexB = Lock()

class testThread(Thread):
    def run(self):
        self.func1()
        self.func2()
    def func1(self):
        mutexA.acquire()
        print('%s 拿到A锁' %self.name)

        mutexB.acquire()
        print('%s 拿到B锁' %self.name)
        mutexB.release()
        mutexA.release()

    def func2(self):
        mutexB.acquire()
        print('%s 拿到B锁' %self.name)
        time.sleep(2)

        mutexA.acquire()
        print('%s 拿到A锁' %self.name)
        mutexA.release()

        mutexB.release()

if __name__ == '__main__':
    for i in range(5):
        t = testThread()
        t.start()

The output is:

Thread-1 拿到A锁
Thread-1 拿到B锁
Thread-1 拿到B锁Thread-2 拿到A锁

Analyze how the above code generates a deadlock: 
start 5 threads and execute the run method. If thread1 first grabs the A lock, thread1 does not release the A lock at this time, and then executes the code mutexB.acquire() to grab the B lock. When grabbing the B lock, no other thread competes with thread1, because the A lock is not released, other threads can only wait, and then the A lock executes the func1 code, and then continues to execute the func2 code. At the same time, in func2, execute The code mutexB.acquire() grabs the B lock, and then enters the sleep state. After thread1 executes the func1 function and releases the AB lock, other remaining threads also start to grab the A lock and execute the func1 code. If thread2 grabs the A lock, Next thread2 wants to grab the B lock, ok, during this time period, thread1 has executed func2 to grab the B lock, and then in sleep(2), the B lock is not released, why it is not released, because there is no other thread to compete with it To grab, he can only fall asleep, and then thread1 holds the B lock, and thread2 wants to grab the B lock, ok, thus forming a deadlock.

2 recursive lock

We analyzed the deadlock above, so how to solve the problem of deadlock in python? In order to support multiple requests for the same resource in the same thread in Python, python provides a reentrant lock RLock. This RLock internally maintains a Lock and a counter variable, and the counter records the number of acquires, so that resources can be required multiple times. Until all acquires of a thread are released, other threads can obtain resources. In the above example, if RLock is used instead of Lock, no deadlock will occur:

from threading import Thread,RLock
import time

mutexA = mutexB = RLock()

class testThread(Thread):
    def run(self):
        self.f1()
        self.f2()

    def f1(self):
        mutexA.acquire()
        print('%s 拿到A锁' %self.name)
        mutexB.acquire()
        print('%s 拿到B锁' %self.name)
        mutexB.release()
        mutexA.release()

    def f2(self):
        mutexB.acquire()
        print('%s 拿到B锁' % self.name)
        time.sleep(0.1)
        mutexA.acquire()
        print('%s 拿到A锁' % self.name)
        mutexA.release()
        mutexB.release()

if __name__ == '__main__':
    for i in range(5):
        t=testThread()
        t.start()

The output is:

Thread-1 拿到A锁
Thread-1 拿到B锁
Thread-1 拿到B锁
Thread-1 拿到A锁
Thread-2 拿到A锁
Thread-2 拿到B锁
Thread-2 拿到B锁
Thread-2 拿到A锁
Thread-4 拿到A锁
Thread-4 拿到B锁
Thread-4 拿到B锁
Thread-4 拿到A锁
Thread-3 拿到A锁
Thread-3 拿到B锁
Thread-3 拿到B锁
Thread-3 拿到A锁
Thread-5 拿到A锁
Thread-5 拿到B锁
Thread-5 拿到B锁
Thread-5 拿到A锁

Explain the code of the recursive lock: 
Since lock A and B are the same recursive lock, thread1 gets A and B locks, and the counter records the number of acquires twice, and then releases the recursive lock after func1 is executed, and releases it after thread1 Recursive lock, after executing the func1 code, there will be two possibilities next, 1, thread1 grabs the recursive lock next time, executes the func2 code 2, other threads grab the recursive lock, and execute the task code of func1.

Guess you like

Origin blog.csdn.net/qq_40716944/article/details/121510749