Artificial Intelligence-Asynchronous Computing

Asynchronous calculation

Today’s computers are highly parallel systems, consisting of multiple CPU cores, multiple GPUs, and multiple processing units. Typically there are multiple threads per CPU core, there are typically multiple GPUs per device, and each GPU has multiple processing units. In summary, we can be working on many different things at the same time, often on different devices. Unfortunately, Python is not very good at writing parallel and asynchronous code, at least not without some extra help. Ultimately, Python is single-threaded, and that's unlikely to change in the future. Therefore, among many deep learning frameworks, MXNet and TensorFlow use an asynchronous programming (asynchronous programming) model to improve performance, while PyTorch Python's own scheduler is used to achieve different performance trade-offs. GPU operations are asynchronous by default for PyTorch. When a function that uses the GPU is called, the operation is queued to the specific device but does not necessarily wait until later. This allows us to perform more calculations in parallel, including operations on the CPU or other GPUs.

Therefore, understanding how asynchronous programming works helps us develop more efficient programs by proactively reducing computational requirements and interdependencies. This reduces memory overhead and improves processor utilization.

import os
import subprocess
import numpy
from mxnet import autograd, gluon, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

Asynchronous processing via backend 

As a warm-up, consider a simple problem: generate a random matrix and multiply it. Let's do this in both NumPy and mxnet.np and see what the difference is.

# GPU计算热身
device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)

with d2l.Benchmark('numpy'):
    for _ in range(10):
        a = numpy.random.normal(size=(1000, 1000))
        b = numpy.dot(a, a)

with d2l.Benchmark('torch'):
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)

numpy: 1.0704

sec torch: 0.0013 sec

The benchmark output comparison via PyTorch is orders of magnitude faster. NumPy dot products are performed on the CPU, while PyTorch matrix multiplication is performed on the GPU, which is much faster. But the huge time gap suggests there must be something else going on. By default, GPU operations are asynchronous in PyTorch. Forcing PyTorch to complete all computations before returning illustrates what happened before: the computations were performed by the backend and the frontend returned control to Python.

with d2l.Benchmark():
    for _ in range(10):
        a = torch.randn(size=(1000, 1000), device=device)
        b = torch.mm(a, a)
    torch.cuda.synchronize(device)

 Done: 0.0049 sec

Broadly speaking, PyTorch has a front-end for direct interaction with the user (e.g. via Python), and a back-end used by the system to perform computations. As shown in the figure, users can write PyTorch programs in various front-end languages, such as Python and C++. Regardless of the front-end programming language used, the execution of PyTorch programs primarily occurs on the back-end implemented in C++. Operations issued by the front-end language are passed to the back-end for execution. The backend manages its own threads, which continuously collect and execute queued tasks. Note that for this to work, the backend must be able to track the dependencies between the various steps in the computation graph. Therefore, it is not possible to parallelize interdependent operations.

Let's look at another simple example to better understand the dependency graph.

x = torch.ones((1, 2), device=device)
y = torch.ones((1, 2), device=device)
z = x * y + 2
z

 tensor([[3., 3.]],device='cuda:0')

The above code snippet is illustrated in the figure. Whenever the Python front-end thread executes one of the first three statements, it simply returns the task to the back-end queue. When the result of the last statement needs to be printed out, the Python front-end thread will wait for the C++ back-end thread to complete the calculation of the result of variable z. One benefit of this design is that the Python front-end thread does not need to perform actual calculations. Therefore, regardless of Python's performance, it has little impact on the overall performance of the program. The diagram below demonstrates how the front-end and back-end interact.

Guess you like

Origin blog.csdn.net/weixin_43227851/article/details/134868616
Recommended