Pytorch tries to speed up training and reasoning by strengthening the use of cpu (1)

Goal: pytorch whether it is training or inference. The utilization rate of cpu is basically only about 10%. Try to make better use of cpu to speed up training and inference.

1. When searching for information, many authors have mentioned the Python GIL problem. Here, let’s first understand this mechanism and start directly with examples.

The CPU of this machine is i5-4460 4 core 4 threads

First refer to the clear description of Di Sheng's handwritten notes , this experiment uses python3.5, so modify the code. (Python 3.2 started to use the new GIL. When other threads request this lock in the new GIL implementation, the current thread will be forced to release the lock after 5ms.)

GIL lock release mechanism:

 Multithreading in the Python interpreter process is a cooperative multitasking execution. When a thread encounters an I/O task, it will release the GIL. A CPU-bound thread will release the GIL when it executes about 100 ticks of the interpreter. Step counting (ticks) can be roughly regarded as the instructions of the Python virtual machine. Step counting actually has nothing to do with the length of the time slice. The step length can be set by sys.setcheckinterval().

 A1. A single thread executes the same program call, which takes 71.47s

import time


def counter1():
    for i in range(300000000):
        i = i + 1
    print("this is i:", i + 5)


def counter2():
    for j in range(300000000):
        j = j + 1
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        counter2()
        counter1()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()


this is j: 300000010
this is i: 300000005
this is j: 300000010
this is i: 300000005
Total time: 71.47001194953918

 A2. It takes 72.08s to execute the same program in multiple threads.

from threading import Thread
import time


def counter1():
    for i in range(300000000):
        i = i + 1
    print("this is i:", i + 5)


def counter2():
    for j in range(300000000):
        j = j + 1
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        t1 = Thread(target=counter2)
        t2 = Thread(target=counter1)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()


this is i: 300000005
this is j: 300000010
this is i: 300000005
Total time: 72.07586812973022
this is j: 300000010

Obviously the above two cases show that the same program is executed in python (Cpthon) with single thread faster than multithreaded. Because of the GIL lock, multithreading actually needs to switch frequently for concurrent operations, especially for multi-core CPUs. Say, there is severe thread thrashing.

B1. Use a single thread to execute the same program. Note that it is the same program above. Here, sleep (0.01) time-consuming operation is added to the code. As a result, it took 42.10s to execute the program in a single thread at this time.

import time


def counter1():
    for i in range(1000):
        i = i + 1
        time.sleep(0.01)
    print("this is i:", i + 5)


def counter2():
    for j in range(1000):
        j = j + 1
        time.sleep(0.01)
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        counter2()
        counter1()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()
 

this is j: 1010
this is i: 1005
this is j: 1010
this is i: 1005
Total time: 42.09901976585388

B2. Also use multiple threads to execute the same program. Note that the same program is the same. This type of code adds sleep (0.01) time-consuming operations. As a result, it took 22.00s to execute the program by multithreading at this time.

from threading import Thread
import time


def counter1():
    for i in range(1000):
        i = i + 1
        time.sleep(0.01)
    print("this is i:", i + 5)


def counter2():
    for j in range(1000):
        j = j + 1
        time.sleep(0.01)
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        t1 = Thread(target=counter1)
        t2 = Thread(target=counter2)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()



this is j: 1010
this is i: 1005
this is i: 1005
this is j: 1010
Total time: 22.006017684936523

Why in the same program, multi-threaded operation in python is faster than single-threaded execution after adding sleep time-consuming operation? Doesn't this contradict the above result? This is actually the release mechanism of the GIL lock. As above: When a thread encounters an I/O task, it will release the GIL. A CPU-bound thread will release the GIL when it executes about 100 ticks of the interpreter. So we have added sleep time-consuming operations, which is equivalent to turning a computational program into a time-consuming waiting I/O program. At this time, when the GIL lock encounters an I/O task, it will not continue to wait for the time-consuming operation. The lock is released immediately and executed by other threads. In this case, the efficiency will be much higher than that of a single thread (because a single thread needs to wait for the time-consuming end to continue execution )

 

2. I also found someone who said that multithreading in Python is fake multithreading. Here is a reference to DarrenChan Chen Chi 's "Why do some people say that Python's multithreading is tasteless?" 'Answer.

Before introducing threads in Python, let us clarify a problem. Multithreading in Python is fake multithreading ! Why do you say that, let us first clarify a concept, the global interpreter lock (GIL).

The execution of Python code is controlled by the Python virtual machine (interpreter). When Python was designed, it was considered to be in the main loop with only one thread executing at the same time. Just like running multiple processes in a single CPU system, multiple programs can be stored in the memory, but at any time, only one program is on the CPU. Run in. Similarly, although the Python interpreter can run multiple threads, only one thread runs in the interpreter.

Access to the Python virtual machine is controlled by the global interpreter lock (GIL), which ensures that only one thread is running at the same time. In a multithreaded environment, the Python virtual machine executes in the following manner.

1. Set the GIL.

2. Switch to a thread to execute.

3. Run.

4. Set the thread to sleep.

5. Unlock the GIL.

6. Repeat the above steps again.

For all I/O-oriented programs (that call the built-in operating system C code), the GIL will be released before the I/O call to allow other threads to run while the thread is waiting for I/O. If a thread does not use many I/O operations, it will always occupy the processor and GIL in its own time slice. In other words, I/O-intensive Python programs can take full advantage of the benefits of multithreading than computationally intensive Python programs.

We all know that, for example, I have a 4-core CPU, so in this way, each core can only run one thread per unit time, and then the time slices are switched in rotation. But Python is different. It doesn't matter how many cores you have, multiple cores can only run one thread per unit time, and then the time slice rotates. Looks incredible? But this is the ghost of GIL. Before any Python thread executes, it must first obtain the GIL lock. Then, every time 100 bytes of bytecode are executed, the interpreter automatically releases the GIL lock, allowing other threads to have a chance to execute. This GIL global lock actually locks the execution code of all threads. Therefore, multiple threads can only be executed alternately in Python. Even if 100 threads run on a 100-core CPU, only one core can be used. Usually the interpreter we use is the official implementation of CPython, and we must really use multi-core unless we rewrite an interpreter without GIL.

First use python multithreading:

#coding=utf-8
from multiprocessing import Pool
from threading import Thread

from multiprocessing import Process


def loop():
    while True:
        pass

if __name__ == '__main__':

    for i in range(3):
        t = Thread(target=loop)
        t.start()

    while True:
        pass

The CPU accounted for only 30%.

 

Python multi-process:

#coding=utf-8
from multiprocessing import Pool
from threading import Thread

from multiprocessing import Process


def loop():
    while True:
        pass

if __name__ == '__main__':

    for i in range(3):
        t = Process(target=loop)
        t.start()

    while True:
        pass

The proportion of CPU is directly 100%. A comparison shows that multiple processes use multiple cores.

 

According to the above logic, I will write a multi-process B3 analogy in the first big point. As a result, it took 21.89s to execute the program in multiple threads at this time. Compared with the previous 22.00s. There is an effect but the effect is small. It should be because of the long sleep time (there is full time for single-core switching back and forth to do dual-core things).

import time
from multiprocessing import Process

def counter1():
    for i in range(1000):
        i = i + 1
        time.sleep(0.01)
    print("this is i:", i + 5)


def counter2():
    for j in range(1000):
        j = j + 1
        time.sleep(0.01)
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        t1 = Process(target=counter1)
        t2 = Process(target=counter2)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()


this is i: 1005
this is j: 1010
this is j: 1010
this is i: 1005
Total time: 21.886003255844116

Follow my logic above. When I set sleep to 0.0001. Multithreading: 4.01s VS Multiprocess: 3.71s The difference is not big, but the effect is more obvious than before. (After many attempts, I found that the time before 0.001-0.000000001 was around 4s vs 3.7s. Is there any bottleneck in it?)

I then modify the number of cycles: Change the cycle in B2 and B3 to 100, and sleep is still 0.0001. Multithreading: 0.40s VS Multiprocess: 1.08s 

                                     Change the loop in B2 and B3 to 10000 and sleep is still 0.0001. Multithreading: 40.01s VS Multiprocess: 29.94s    

Finally, at the extreme point, change the loop in B2 and B3 to 5, and sleep is still 0.1 time (). Multi-thread: 1.01s VS Multi-process: 1.82s    

The first big point has already explained the time-consuming operation of sleep, which is equivalent to turning a computational program into a time-consuming waiting I/O program. The number of cycles is equivalent to computational intensive. These comparisons show that there are I/O programs and computationally intensive programs in B2 and B3, and the interaction between the two cannot be simply compared.

 

 

 

Since the above example is not suitable for multi-threading advantages. I modified it to multi-process A3. As a result, it took 36.63s to execute the program in multiple threads at this time. Compared to nearly half of the previous 72.08s.

from threading import Thread
import time
from multiprocessing import Process

def counter1():
    for i in range(300000000):
        i = i + 1
    print("this is i:", i + 5)


def counter2():
    for j in range(300000000):
        j = j + 1
    print("this is j:", j + 10)


def main():
    start_time = time.time()
    for x in range(2):
        t1 = Process(target=counter2)
        t2 = Process(target=counter1)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Total time: {}".format(end_time - start_time))


if __name__ == '__main__':
    main()


this is j: 300000010
this is i: 300000005
this is i: 300000005
this is j: 300000010
Total time: 36.62899208068848

A1, A2, and A3 are computationally intensive programs. A2 and A3 compare the calculation time of computationally intensive programs: multi-core and multi-process> single-core and multi-thread.

 

Compare B2 and B3 as a simple multi-threaded and multi-process I/O program and compare C.

from threading import Thread
import time
from multiprocessing import Process

def counter1():
    time.sleep(0.1)


def counter2():
    time.sleep(0.1)



def main_Thread():
    start_time = time.time()
    for x in range(100):
        t1 = Thread(target=counter1)
        t2 = Thread(target=counter2)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Thread Total time: {}".format(end_time - start_time))
def main_Process():
    start_time = time.time()
    for x in range(100):
        t1 = Process(target=counter1)
        t2 = Process(target=counter2)
        t1.start()
        t2.start()
        t2.join()

    end_time = time.time()
    print("Process Total time: {}".format(end_time - start_time))

if __name__ == '__main__':
    main_Thread()
    main_Process()


Thread Total time: 10.126013040542603
Process Total time: 49.22399544715881

It can be seen from C that the calculation time of the I/O program is: single-core and multi-thread> multi-core and multi-process.

Guess you like

Origin blog.csdn.net/qq_36401512/article/details/113105009