Python multi-threaded multi-process and coroutine summary

Multithreading

In the true sense, multithreading is controlled by the CPU. For example, if a CPU-intensive program is written in C and runs on a quad-core processor, multithreading can increase efficiency by up to 4 times. But if you write in Python, the efficiency will not improve, or even slow down, because multi-threading in Python is controlled by GIL. The full name of GIL is Global Interpreter Lock (Global Interpreter Lock). The original design concept of Python is to To solve the problem of data integrity and state synchronization between multiple threads, it is designed so that only one thread can run in the interpreter at any time . Therefore, multithreading in Python is superficial multithreading (only one thread at a time), not real multithreading. In addition to the execution of the program itself, there is also more time spent on thread switching. Therefore, Python multithreading is relatively more suitable for writing I/O-intensive programs , and CPU-intensive programs that really require high efficiency are written in C/C++.

Threading implements multithreading

# encoding:utf-8
import threading

def thread_job():
    print("abc")
added_thread1 = threading.Thread(target=thread_job)
added_thread1.start()

The function to be executed by specifying the target is the name of the set thread setting. If you want to create the parameters of the execution parameters, you can set the execution parameters. The setting format must be Tuple(variable,) threading.Thread(
target=function, name ="Thread", args=parameter)
to start executing the thread, specify the name of the thread before starting
<thread>.start()
will suspend the main thread, wait for the end of the specified thread, put the specified thread name
<thread>.join in front of join ()
- View how many threads are currently
threading.active_count()
View the information of the currently used thread
threading.enumerate()
View which thread is currently in
threading.current_thread() 

#Print thread name
t1=threading.currentThread().name;

join suspends the main thread

Use join() to suspend the main thread and wait for the specified thread to end before the main thread ends

# encoding:utf-8
import threading
import time
def thread_first_job(x):
    time.sleep(0.1)
    print("This is the first thread ", x)

def thread_second_job(x):
    print("This is the second thread ", x)

first_thread = threading.Thread(target=thread_first_job, args=("Hi",))
second_thread = threading.Thread(target=thread_second_job, args=("Hello",))
ts=[first_thread,second_thread]
for t in ts:
    t.start()
#等待执行线程结束
for t in ts:
    t.join()
print('all done')

setDaemon sets the daemon thread

If you want to end the main thread with the main thread regardless of whether other Threads have finished executing, setDaemon() must be written before start(), and the default is False.

# encoding:utf-8
import threading
import time

def thread_first_job(x):
    time.sleep(5)
    print("This is the first thread ", x)

def thread_second_job(x):
    print("This is the second thread ", x)

first_thread = threading.Thread(target=thread_first_job, args=("Hi",))
second_thread = threading.Thread(target=thread_second_job, args=("Hello",))
first_thread.setDaemon(True)
first_thread.start()
second_thread.start()

# ---- output ---
#主线程结束 first_thread还没有来得及执行就必须结束了
# This is the second thread  Hello

queue gets thread results

# encoding:utf-8
import threading
from queue import Queue

# 将要传回的值存入Queue
def thread_job(data, q):
    for i in range(len(data)):
        data[i] = data[i] * 2
    q.put(data)

def multithread():
    data = [[1, 2, 3], [4, 5, 6]]
    q = Queue()
    all_thread = []

    # 使用 multi-thread
    for i in range(len(data)):
        thread = threading.Thread(target=thread_job, args=(data[i], q))
        thread.start()
        all_thread.append(thread)

    # 等待全部 Thread执行完毕
    for t in all_thread:
        t.join()
    # 使用 q.get() 取出要传回的值
    result = []
    for _ in range(len(all_thread)):
        result.append(q.get())
    print(result)


multithread()
# ====== output ======
# [[2, 4, 6], [8, 10, 12]]

Lock thread synchronization

When several Threads want to use the same data at the same time, in order not to cause Race Condition, you need to use lock.acquire() and lock.release() to lock it and prevent other Threads from executing it.

The main function of Python mutex Lock is to protect shared resources and prevent dirty data when accessing shared resources in parallel.

#First look at the situation without locking

# encoding:utf-8
import threading

def job1():
    global n
    for i in range(50):
        n+=1
        print('job1',n)

def job2():
    global n
    for i in range(50):
        n+=10
        print('job2',n)

n=0
t1=threading.Thread(target=job1)
t2=threading.Thread(target=job2)
t1.start()
t2.start()

 Look at the execution result, it is messy

 Look at the locking situation

def job1():
    global n, lock
    # 获取锁
    lock.acquire()
    for i in range(10):
        n += 1
        print('job1', n)
    lock.release()


def job2():
    global n, lock
    # 获取锁
    lock.acquire()
    for i in range(10):
        n += 10
        print('job2', n)
    lock.release()

n = 0
# 生成锁对象
lock = threading.Lock()

t1 = threading.Thread(target=job1)
t2 = threading.Thread(target=job2)
t1.start()
t2.start()

Since the thread of job1 got the lock first, no one has permission to operate on n in the for loop. After job1 finishes executing and releases the lock, job2 gets the lock and starts its own for loop.
Look at the execution results, as we expected.

 Semaphore

Semaphore is similar to Lock, but Semaphore has the function of counter, which can specify the allowed number of threads to execute at the same time.

# encoding:utf-8
# -*- coding:utf-8 -*-
import threading
import time

sem = threading.Semaphore(3)

class DemoThread(threading.Thread):

    def run(self):
        print('{0} is waiting semaphore.'.format(self.name))
        sem.acquire()
        print('{0} acquired semaphore({1}).'.format(self.name, time.ctime()))
        time.sleep(5)
        print('{0} release semaphore.'.format(self.name))
        sem.release()


if __name__ == '__main__':
    threads = []
    for i in range(4):
        threads.append(DemoThread(name='Thread-' + str(i)))

    for t in threads:
        t.start()

    for t in threads:
        t.join()

Running results: It can be seen that Thread-3 obtains the signal object after Thread-0 is released.

Thread-0 is waiting semaphore.
Thread-0 acquired semaphore(Thu Oct 25 20:33:18 2018).
Thread-1 is waiting semaphore.
Thread-1 acquired semaphore(Thu Oct 25 20:33:18 2018).
Thread-2 is waiting semaphore.
Thread-2 acquired semaphore(Thu Oct 25 20:33:18 2018).
Thread-3 is waiting semaphore.
Thread-0 release semaphore.
Thread-3 acquired semaphore(Thu Oct 25 20:33:23 2018).
Thread-1 release semaphore.
Thread-2 release semaphore.
Thread-3 release semaphore.

 Source: https://www.jianshu.com/p/e52154188acc

 Rlock reentrant lock

RLock is similar to Lock, but RLock allows the same thread to repeatedly acquire the right to use the lock.

Use acquire to obtain rlock, release to release rlock

The common lock is threading.Lock, and if the lock is not released under the same thread, it will cause death if it is acquired again .

One difference between regular locks and Rlocks in Python is that regular locks can be released by different threads, while reentrant locks must be released by the same thread that acquired it , and the number of unlocks should be the same as the number of locks before they can be used on another thread . Also, be careful to avoid splitting the locking operation across multiple threads, if one thread tries to release a lock that has not been acquired, Python will raise an error and crash the program.

Reentrant locks are generally used in object-oriented
applications in different calling orders, and when synchronization is required , if reentrant locks are not used, deadlock will occur. If f1 or f2 is not locked, the data is not synchronized and an error is reported.

# encoding:utf-8
import threading
class A:
   def f1(self):
       mutex.acquire()
       try:
            print('do something')
       finally:
           mutex.release()

   def f2(self):
       mutex.acquire()
       try:
           print('do something')
       finally:
           mutex.release()
           
def run1(obj):
    obj.f1()
    obj.f2()

def run2(obj):
    obj.f2()
    obj.f1()

obj1 = A()
mutex = threading.RLock()
t1 = threading.Thread(target=run1, args=(obj1, ))
t2 = threading.Thread(target=run2, args=(obj1, ))
t1.start()
t2.start()

Source: What are python reentrant locks good for? - H5W3

Condition

It can be understood that Condition provides a multi-thread communication mechanism . If thread 1 needs data, thread 1 will block and wait. At this time, thread 2 will create data. After thread 2 has produced the data, it will notify thread 1 that it can fetch the data. , and then thread 1 goes to get the data.

A typical case is the following producer consumer model

# encoding:utf-8

import threading
import time
con = threading.Condition()
meat_num = 0
def thread_consumers():  # 条件变量 condition 线程上锁
    con.acquire()
    # 全局变量声明关键字 global
    global meat_num
    # 等待肉片下锅煮熟
    con.wait()
    while True:
        print("我来一块肉片...")
        meat_num -= 1
        print("剩余肉片数量:%d" % meat_num)
        time.sleep(0.5)
        if meat_num == 0:
            #  肉片吃光了,通知老板添加肉片
            print("老板,再来一份老肉片...")
            con.notify()
            #  肉片吃光了,等待肉片
            con.wait()

        # 条件变量 condition 线程释放锁
    con.release()


def thread_producer():  # 条件变量 condition 线程上锁
    con.acquire()  # 全局变量声明关键字 global
    global meat_num
    # 肉片熟了,可以开始吃了
    
    meat_num = 10
    print("肉片熟了,可以开始吃了...")
    con.notify()
    while True:
        #  阻塞函数,等待肉片吃完的通知
        con.wait()
        meat_num = 10
        #  添加肉片完成,可以继续开吃
        print("添加肉片成功!当前肉片数量:%d" % meat_num)
        time.sleep(1)
        con.notify()

    con.release()

if __name__ == '__main__':
    t1 = threading.Thread(target=thread_producer)
    t2 = threading.Thread(target=thread_consumers)
    # 启动线程 -- 注意线程启动顺序,启动顺序很重要
    t2.start()
    t1.start()
    # 阻塞主线程,等待子线程结束
    t1.join()
    t2.join()

print("程序结束!")

'''
输出结果:

肉片熟了,可以开始吃了...
我来一块肉片...
剩余肉片数量:9
我来一块肉片...
剩余肉片数量:8
我来一块肉片...
剩余肉片数量:7
我来一块肉片...
剩余肉片数量:6
我来一块肉片...
剩余肉片数量:5
我来一块肉片...
剩余肉片数量:4
我来一块肉片...
剩余肉片数量:3
我来一块肉片...
剩余肉片数量:2
我来一块肉片...
剩余肉片数量:1
我来一块肉片...
剩余肉片数量:0
老板,再来一份老肉片...
添加肉片成功!当前肉片数量:10
我来一块肉片...
剩余肉片数量:9
我来一块肉片...
剩余肉片数量:8
我来一块肉片...
剩余肉片数量:7
.............
'''

Source: https://www.jianshu.com/p/3f6ff092bf3c

Event

Used for communication between threads , by sending the signal set by the thread, if the signal is True, other waiting threads will be woken up after receiving the signal. Provides setting signal event.set(), waiting signal event.wait(), and clearing signal event.clear().

# encoding:utf-8
import threading
import time
def thread_first_job():
    global a
    # 线程进入等待状态
    print("Wait…")
    event.wait()

    for _ in range(3):
        a += 1
        print("This is the first thread ", a)
a = 0
# 创建event对象
event = threading.Event()
first_thread = threading.Thread(target=thread_first_job)
first_thread.start()
time.sleep(3)
# 唤醒处于等待状态的线程
print("Wake up the thread…")
event.set()
first_thread.join()
# ====== output ======
# Wait...
# Wake up the thread...
# This is the first thread  1
# This is the first thread  2
# This is the first thread  3

Thread Pool

Thread pool usage scenarios
When a large number of short-lived threads need to be created in the program, the thread pool should be considered.

Principle of thread pool

The thread pool creates a large number of idle threads when the system starts. As long as the program submits a function to the thread pool, the thread pool will start an idle thread to execute it. When the execution of the function ends, the thread will not die, but return to the thread pool again to become idle, waiting for the execution of the next function.

effect

The number of concurrent multi-threads can be controlled without causing the system to crash

Thread pool creation

concurrent.futures.Executor provides two subclasses
ThreadPoolExecutor: used to create a thread pool
ProcessPoolExecutor: used to create a process pool

pool method

  • submit(fn, *args, **kwargs): Submit the fn function to the thread pool. *args represents the parameters passed to the fn function, and *kwargs represents the parameters passed to the fn function in the form of keyword parameters.
  • map(func, *iterables, timeout=None, chunksize=1): This function is similar to the global function map(func, *iterables), except that this function will start multiple threads to perform map processing on iterables immediately in an asynchronous manner.
  • shutdown(wait=True): Close the thread pool. After adding pool.shutdown(), it will wait for all threads to finish running before running the following statement.

 pool.submit

# encoding:utf-8
from concurrent.futures import ThreadPoolExecutor
import threading
import time
# 定义一个准备作为线程任务的函数
def action(max):
    my_sum = 0
    for i in range(max):
        print(threading.current_thread().name + '  ' + str(i))
        my_sum += i
    return my_sum
# 创建一个包含2条线程的线程池
pool = ThreadPoolExecutor(max_workers=2)
# 向线程池提交一个task, 50会作为action()函数的参数
future1 = pool.submit(action, 50)
# 向线程池再提交一个task, 100会作为action()函数的参数
future2 = pool.submit(action, 100)
# 判断future1代表的任务是否结束
print(future1.done())
time.sleep(3)
# 判断future2代表的任务是否结束
print(future2.done())
# 查看future1代表的任务返回的结果
print(future1.result())
# 查看future2代表的任务返回的结果
print(future2.result())
# 关闭线程池
pool.shutdown()

pool.map

Map can guarantee the order of output , and the order of submit output is chaotic.

If the function of the task you want to submit is the same, it can be simplified to map.

The function of this method is similar to the global function map(), the difference is that the map() method of the thread pool will start a thread for each element of iterables to execute the func function in a concurrent manner. This method is equivalent to starting len(iterables) threads, and collecting the execution results of each thread.

# encoding:utf-8
import datetime
import threading
from concurrent.futures import ThreadPoolExecutor
import time
def spider(page):
    time.sleep(page)
    print(f"crawl task{page} finished  tread_name"+threading.currentThread().name)
    return page

pool = ThreadPoolExecutor(max_workers=4)
print ("start processes...")
#for i in range(10):
#    t = pool.submit(spider, i)
t1=datetime.datetime.now()
pool.map(spider, [0,1,2,3])
pool.shutdown()
t2=datetime.datetime.now()
#time.sleep(20)
print("time cost==="+str(t2-t1))
print ("is ok?")
print ("all is ok!")

Future method 

The submit method will return a Future object, and the Future class is mainly used to obtain the return value of the thread task function.


Future provides the following methods

cancel(): cancel the thread task represented by the Future. If the task is being executed and cannot be canceled, the method returns False; otherwise, the program cancels the task and returns True.
canceled(): Returns whether the thread task represented by the Future is successfully canceled.
running(): If the thread task represented by the Future is executing and cannot be canceled, this method returns True.
done(): If the thread task represented by the Future is successfully canceled or executed, this method returns True.
result(timeout=None): Get the result returned by the thread task represented by the Future at the end. If the thread task represented by Future has not been completed, this method will block the current thread, where the timeout parameter specifies the maximum number of seconds to block.
exception(timeout=None): Get the exception caused by the thread task represented by this Future. If the task completed successfully without exception, the method returns None.
add_done_callback(fn): Register a "callback function" for the thread task represented by the Future. When the task is successfully completed, the program will automatically trigger the fn function.
 

Future callback

If the program does not want to directly call the result() method to block the thread , it can add a callback function through the Future's add_done_callback() method, and the callback function is in the form of fn(future). When the thread task is completed, the program will automatically trigger the callback function and pass the corresponding Future object as a parameter to the callback function.

# encoding:utf-8
from concurrent.futures import ThreadPoolExecutor
import threading
import time

# 定义一个准备作为线程任务的函数
def action(max):
    my_sum = 0
    for i in range(max):
        # print(threading.current_thread().name + '  ' + str(i))
        my_sum += i
    return my_sum
# 创建一个包含2条线程的线程池
with ThreadPoolExecutor(max_workers=2) as pool:
    # 向线程池提交一个task, 50会作为action()函数的参数
    future1 = pool.submit(action, 50)
    # 向线程池再提交一个task, 100会作为action()函数的参数
    future2 = pool.submit(action, 100)
    def get_result(future):
        print(future.result())
    # 为future1添加线程完成的回调函数
    future1.add_done_callback(get_result)
    # 为future2添加线程完成的回调函数
    future2.add_done_callback(get_result)
    print('--------------')

 Source
https://www.jianshu.com/p/037219482691

Python thread pool and its principle and use (super detailed)

multi-Progress

Python's threading package mainly uses multi-threaded development, but due to the existence of GIL, multi-threading in Python is not really multi-threading. If you want to fully use the resources of multi-core CPUs, you need to use multi-processes in most cases . The multiprocessing package was introduced in Python 2.6, which completely replicates a set of interfaces provided by threading for easy migration. The only difference is that it uses multiple processes instead of multiple threads. Each process has its own independent GIL, so there will be no GIL contention between processes.

Process creates a process

Note: In Windows, Process() must be placed under if __name__=='__main__':

# encoding:utf-8
from multiprocessing import Process
import os
def run_proc(name):
    print('Run child process %s (%s)...' % (name, os.getpid()))
if __name__=='__main__':
    print('Parent process %s.' % os.getpid())
    p = Process(target=run_proc, args=('test',))
    print('Child process will start.')
    p.start()
    p.join()
print('Child process end.')

multiprocessing module

The multiprocessing module is a cross-platform version of the multi-processing module . The multiprocessing module encapsulates the fork() call, so that we don't need to pay attention to the details of fork(). Since Windows does not have a fork call, multiprocessing needs to "simulate" the effect of fork.

Pool class (process pool)

The Pool class is used for many targets that need to be executed, and when manually limiting the number of processes is too cumbersome, if there are few targets and the number of processes does not need to be controlled, the Process class can be used. Pool can provide a specified number of processes for users to call to create a management process pool.

pool.map

# encoding:utf-8
# -*- coding:utf-8 -*-
# Pool+map
import datetime
import os
from multiprocessing import Pool
from time import sleep
def test(i):
    print(i)
    sleep(1*5)
    print('Run child process  (%s)...' % (  os.getpid()))
if __name__ == "__main__":
    lists = range(30)
    pool = Pool(8)
    t1=datetime.datetime.now()
    pool.map(test, lists)
    pool.close()
    pool.join()
    t2 = datetime.datetime.now()
    print(f'cost==={t2-t1}')

pool.apply_async

# encoding:utf-8
# -*- coding:utf-8 -*-
# 异步进程池(非阻塞)
from multiprocessing import Pool
def test(i):
    print(i)
if __name__ == "__main__":
    pool = Pool(8)
    for i in range(100):
        pool.apply_async(test, args=(i,))
    #子进程和父进程是异步的
    print("test")
    pool.close()
    pool.join()

pool.apply

# encoding:utf-8
# -*- coding:utf-8 -*-
# 异步进程池(非阻塞)
from multiprocessing import Pool
def test(i):
    print(i)
if __name__ == "__main__":
    pool = Pool(8)
    for i in range(100):
        pool.apply(test, args=(i,))
    #在线程池里的任务执行完毕后才执行下面的打印
    print("test")
    pool.close()
    pool.join()

Queue class (process resource sharing)

For process communication, resource sharing
In the process of using multi-process, it is best not to use shared resources. Ordinary global variables cannot be shared by child processes, only data structures constructed by Multiprocessing components can be shared.

Queue is a class used to create a queue for inter-process resource sharing. Using Queue can achieve the function of data transfer between multiple processes (disadvantage: only applicable to the Process class, and cannot be used in the Pool process pool ).

Queue

# encoding:utf-8
from multiprocessing import Process, Queue
import os, time, random
def write(q):
    print('Process to write: %s' % os.getpid())
    for value in ['A', 'B', 'C']:
        print('Put %s to queue...' % value)
        q.put(value)
        time.sleep(random.random())

def read(q):
    print('Process to read: %s' % os.getpid())
    while True:
        value = q.get(True)
        print('Get %s from queue.' % value)

if __name__ == "__main__":
    q = Queue()
    pw = Process(target=write, args=(q,))
    pr = Process(target=read, args=(q,))
    pw.start()
    pr.start()
    pw.join()  # 等待pw结束
    pr.terminate()  # pr进程里是死循环,无法等待其结束,只能强行终止

JoinableQueue 

A JoinableQueue is like a Queue object, but the queue allows consumers of items to notify producers that items have been successfully processed. Notifying processes is implemented using shared signals and condition variables.

# encoding:utf-8
from multiprocessing import Process, JoinableQueue
import time, random
def consumer(q):
    while True:
        res = q.get()
        print('消费者拿到了 %s' % res)
        q.task_done()
def producer(seq, q):
    for item in seq:
        time.sleep(random.randrange(1,2))
        q.put(item)
        print('生产者做好了 %s' % item)
    q.join()
if __name__ == "__main__":
    q = JoinableQueue()
    seq = ('产品%s' % i for i in range(5))
    p = Process(target=consumer, args=(q,))
    p.daemon = True  # 设置为守护进程,在主线程停止时p也停止,但是不用担心,producer内调用q.join保证了consumer已经处理完队列中的所有元素
    p.start()
    producer(seq, q)
    print('主线程')

Value, Array (process resource sharing)

It is used for process communication and resource sharing. Note: Value and Array are only applicable to Process class .
The implementation principles of Value and Array in multiprocessing are to create ctypes() objects in shared memory to achieve the purpose of sharing data. The implementation methods of the two are similar, except that different ctypes data types are selected.

# encoding:utf-8
import multiprocessing
def f(n, a):
    n.value = 3.14
    a[0] = 5
if __name__ == '__main__':
    num = multiprocessing.Value('d', 0.0)
    arr = multiprocessing.Array('i', range(10))
    p = multiprocessing.Process(target=f, args=(num, arr))
    p.start()
    p.join()
    print(num.value)
    print(arr[:])

Pipe (process resource sharing)

For pipeline communication,
multi-process also has a data transfer method called pipeline, which is the same as Queue.
Pipe can create a pipeline between processes and return a tuple (conn1, conn2), where conn1 and conn2 represent the connection objects at both ends of the pipeline. Emphasize one point: the pipeline must be generated before the Process object is generated .

# encoding:utf-8
from multiprocessing import Process, Pipe
import time
# 子进程执行方法
def f(Subconn):
    time.sleep(1)
    Subconn.send("吃了吗")
    print("来自父亲的问候:", Subconn.recv())
    Subconn.close()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe()  # 创建管道两端
    p = Process(target=f, args=(child_conn,))  # 创建子进程
    p.start()
    print("来自儿子的问候:", parent_conn.recv())
    parent_conn.send("嗯")

 Manager (process resource sharing)

The manager object returned by Manager() for resource sharing
controls a server process, and the python objects contained in this process can be accessed by other processes through proxies. In order to achieve data communication between multiple processes and security. The Manager module is often used together with the Pool module .

SyncManager, the following types are not process-safe and need to be locked .

Array(self,*args,**kwds)
BoundedSemaphore(self,*args,**kwds)
Condition(self,*args,**kwds)
Event(self,*args,**kwds)
JoinableQueue(self,*args,**kwds)
Lock(self,*args,**kwds)
Namespace(self,*args,**kwds)
Pool(self,*args,**kwds)
Queue(self,*args,**kwds)
RLock(self,*args,**kwds)
Semaphore(self,*args,**kwds)
Value(self,*args,**kwds)
dict(self,*args,**kwds)
list(self,*args,**kwds)

# encoding:utf-8
import multiprocessing
def f(x, arr, l, d, n):
    x.value = 3.14
    arr[0] = 5
    l.append('Hello')
    d[1] = 2
    n.a = 10

if __name__ == '__main__':
    server = multiprocessing.Manager()
    x = server.Value('d', 0.0)
    arr = server.Array('i', range(10))
    l = server.list()
    d = server.dict()
    n = server.Namespace()
    proc = multiprocessing.Process(target=f, args=(x, arr, l, d, n))
    proc.start()
    proc.join()
    print(x.value)
    print(arr)
    print(l)
    print(d)
    print(n)

sync subprocess module

The usage is similar to the usage in multithreading, except that the classes in multiprocessing are used

Lock (mutual exclusion lock)
#from multiprocessing import Process, Lock
RLock (reentrant mutual exclusion lock (the same process can obtain it multiple times without blocking)
#from multiprocessing import Process, RLock
Semaphore (semaphore)
#from multiprocessing import Process, Semaphore
Condition (condition variable)
#import multiprocessing
Event (event)
#import multiprocessing

# encoding:utf-8
from multiprocessing import Process, Lock
def l(lock, num):
    lock.acquire()
    print("Hello Num: %s" % (num))
    lock.release()
if __name__ == '__main__':
    lock = Lock()  # 这个一定要定义为全局
    for num in range(20):
        Process(target=l, args=(lock, num)).start()

Python concurrent concurrent.futures

The Python standard library provides us with threading and multiprocessing modules to write corresponding multi-thread/multi-process codes. Starting from Python3.2 , the standard library provides us with the concurrent.futures module, which provides two classes, ThreadPoolExecutor and ProcessPoolExecutor, which implements a more advanced abstraction for threading and multiprocessing, and provides a direct way to write thread pools/process pools support.

ThreadPoolExecutor object
The ThreadPoolExecutor class is an Executor subclass that uses a thread pool to execute asynchronous calls.

class concurrent.futures.ThreadPoolExecutor(max_workers)

ProcessPoolExecutor object
The ThreadPoolExecutor class is an Executor subclass that uses a process pool to execute asynchronous calls.

class concurrent.futures.ProcessPoolExecutor(max_workers=None)

Instructions

The following methods are consistent with the usage of multithreading, just use the class in concurrent.futures

submit() method
map() method
shutdown() method
Future

# encoding:utf-8
from concurrent import futures
def test():
    import time
    return time.ctime()

if __name__ == '__main__':
    with futures.ProcessPoolExecutor(max_workers=1) as executor:
        future = executor.submit(test)
        print(future.result())

Source: The most comprehensive arrangement of Python multi-threading and multi-processing- Know about

coroutine

Coroutine: also known as micro-thread, high execution efficiency, subroutine switching is not thread switching, there is no consumption of thread switching , the more threads, the more obvious the performance advantage of coroutine; shared resources in coroutine do not need locks, coroutine is a Threads can make full use of multi-core CPUs through multi-process + coroutine .

advantage

1. No overhead of thread context switching
2. No overhead of atomic operation locking and synchronization
3. Convenient switching of control flow and simplified programming model
4. High concurrency + high scalability + low cost: a CPU supports tens of thousands of coroutines are not question. So it is very suitable for high concurrent processing.

The so-called atomic operation refers to an operation that will not be interrupted by the thread scheduling mechanism; once this operation starts, it will run until the end without any context switch (switching to another thread) in the middle.
An atomic operation can be one step or multiple operation steps, but its order cannot be disrupted, or only the execution part can be cut off. Seeing as a whole is at the heart of atomicity.

shortcoming

1. Unable to utilize multi-core resources: the essence of the coroutine is a single thread. It cannot use multiple cores of a single CPU at the same time. The coroutine needs to cooperate with the process to run on multiple CPUs. Of course, most of the programs we write daily There is no need for this in any application, unless it is a cpu-intensive application.
2. Blocking (Blocking) operations (such as IO) will block the entire program

Coroutine Basics

Python's support for coroutines is implemented through generators. Generator is also called generator.

Implementation of coroutines through gevent


based on greenlets

gevent is a concurrent networking library. Its coroutine is based on greenlet and implements fast event loop based on libev

The basic idea is:
when a greenlet encounters an IO operation, such as accessing the network, it automatically switches to other greenlets, waits until the IO operation is completed, and then switches back to continue execution at an appropriate time. Because IO operations are very time-consuming, the program is often in a waiting state. With gevent automatically switching coroutines for us, it is guaranteed that there are always greenlets running instead of waiting for IO.

spawn builds new coroutines

# encoding:utf-8
import gevent
def foo():
    print('running in foo')
    gevent.sleep(2)#模拟io
    print('com back from bar in to foo')
    return 'foo'

def bar():
    print('running in bar')
    gevent.sleep(1)#模拟io
    print('com back from foo in to bar')
    return 'bar'

def func():
    print('in func of no io')
    return 'func'

def fund():
    print('in fund of no io')
    return 'fund'

jobs=[gevent.spawn(foo),gevent.spawn(bar),gevent.spawn(func),gevent.spawn(fund)]
gevent.joinall(jobs)
for job in jobs:
    print(job.value) #能够保证返回的顺序


monkey.pach_all

Mark third-party libraries as IO non-blocking .

Since the switch is automatically completed during IO operations, gevent needs to modify some standard libraries that come with Python. This process is completed through monkey patch at startup.

Conditions for switching between coroutines:
gevent.sleep(): Switching will only take place after waiting for a long time.

gevent's program patch:
gevent.monkey.patch_all()  # After importing this, you don't need to use gevent.sleep(), as long as it is time-consuming, you can switch tasks.

# encoding:utf-8
from gevent import monkey; monkey.patch_socket()
import gevent
def f(n):
    for i in range(n):
        print (gevent.getcurrent(), i)
        gevent.sleep(0)

g1 = gevent.spawn(f, 5)
g2 = gevent.spawn(f, 5)
g3 = gevent.spawn(f, 5)
g1.join()
g2.join()
g3.join()

Multi-coroutine download pictures

# encoding:utf-8
import gevent
from gevent import monkey
import urllib.request
# 有IO才做时需要这一句
monkey.patch_all()  # 将程序中用到的耗时操作的代码,换为gevent中自己实现的模块
def my_download(file_name, url):
    """
    网上获取数据,并保存本地
    :param file_name: 保存的文件名
    :param url: 网址
    :return:
    """
    print("Get : %s " % url)
    resp = urllib.request.urlopen(url)
    data = resp.read()
    with open(file_name, "wb") as f:
        f.write(data)
    print(file_name, "save ok, %d bytes received from %s." % (len(data), url))

def main():
    # 添加所有协程任务,并等待各个协程任务结束
    gevent.joinall([
        gevent.spawn(my_download, "1.jpg",
                     "https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=617100489,3848171650&fm=26&gp=0.jpg"),
        gevent.spawn(my_download, "2.jpg",
                     "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fdik.img.kttpdq.com%2Fpic%2F3%2F1812%2F15cdcfba19bd15b4_1680x1050.jpg&refer=http%3A%2F%2Fdik.img.kttpdq.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1615523263&t=8ef1c78524053592f5f8148b83def2c4"),
        gevent.spawn(my_download, "3.jpg",
                     "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fdik.img.kttpdq.com%2Fpic%2F3%2F1630%2Fe5428d952b120906.jpg&refer=http%3A%2F%2Fdik.img.kttpdq.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1615523263&t=0f8853c8f464a7f3641aadf3423bf75d")

    ])
if __name__ == '__main__':
    main()

Coroutine pool

Control the number of coroutines through the coroutine pool

The above picture download is controlled by Xie Chengcheng, the code is as follows

# encoding:utf-8
import gevent
from gevent import monkey
import urllib.request
# 有IO才做时需要这一句
from gevent.pool import Pool

monkey.patch_all()  # 将程序中用到的耗时操作的代码,换为gevent中自己实现的模块
def my_download(file_name, url):
    """
    网上获取数据,并保存本地
    :param file_name: 保存的文件名
    :param url: 网址
    :return:
    """
    print("Get : %s " % url)
    resp = urllib.request.urlopen(url)
    data = resp.read()
    with open(file_name, "wb") as f:
        f.write(data)
    print(file_name, "save ok, %d bytes received from %s." % (len(data), url))

def main():
    # 添加所有协程任务,并等待各个协程任务结束
    pool = Pool(5)
    threads = [
        pool.spawn(my_download, "1.jpg",
                     "https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=617100489,3848171650&fm=26&gp=0.jpg"),
        pool.spawn(my_download, "2.jpg",
                     "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fdik.img.kttpdq.com%2Fpic%2F3%2F1812%2F15cdcfba19bd15b4_1680x1050.jpg&refer=http%3A%2F%2Fdik.img.kttpdq.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1615523263&t=8ef1c78524053592f5f8148b83def2c4"),
        pool.spawn(my_download, "3.jpg",
                     "https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fdik.img.kttpdq.com%2Fpic%2F3%2F1630%2Fe5428d952b120906.jpg&refer=http%3A%2F%2Fdik.img.kttpdq.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1615523263&t=0f8853c8f464a7f3641aadf3423bf75d")
    ]
    gevent.joinall(threads)

if __name__ == '__main__':
    main()

Multi-process + Coroutine

Under multi-process + coroutine, the overhead of CPU switching is avoided, and multiple CPUs can be fully utilized. This method has a huge efficiency improvement for crawlers with a large amount of data and file reading and writing.

# encoding:utf-8
# -*- coding=utf-8 -*-
import requests
from multiprocessing import Process
import gevent
from gevent import monkey;
monkey.patch_all()
def fetch(url):
    try:
        s = requests.Session()
        r = s.get(url, timeout=1)  # 在这里抓取页面
    except Exception as e:
        print(str(e))
    return ''


def process_start(url_list):
    tasks = []
    for url in url_list:
        tasks.append(gevent.spawn(fetch, url))
    gevent.joinall(tasks)  # 使用协程来执行


def task_start(filepath, flag=100000):  # 每10W条url启动一个进程
    with open(filepath, 'r') as reader:  # 从给定的文件中读取url
        url = reader.readline().strip()
        url_list = []  # 这个list用于存放协程任务
        i = 0  # 计数器,记录添加了多少个url到协程队列
        while url != '':
            i += 1
            url_list.append(url)  # 每次读取出url,将url添加到队列
            if i == flag:  # 一定数量的url就启动一个进程并执行
                p = Process(target=process_start, args=(url_list,))
                p.start()
                url_list = []  # 重置url队列
                i = 0  # 重置计数器
            url = reader.readline().strip()
        if url_list :  # 若退出循环后任务队列里还有url剩余
            p = Process(target=process_start, args=(url_list,))  # 把剩余的url全都放到最后这个进程来执行
            p.start()


if __name__ == '__main__':
    task_start('./testData.txt')  # 读取指定文件

source:

https://www.jianshu.com/p/77e565a802c2

[LemonCK] Advantages and disadvantages of Python coroutine concurrency - charseki - 博客园

Python3 limits the number of concurrent coroutines through gevent.pool - chengd - 博客园

The use of multi-process + coroutine in python and why to use it - L Yu's Blog - CSDN Blog

Guess you like

Origin blog.csdn.net/csdncjh/article/details/127328570