cast in advance! There is no one of the most simple and effective neural network optimization methods!

Sometimes it is really a headache to do optimization. I rack my brains to figure out how to make the algorithm equivalent and how to streamline the instructions of each layer of the neural network. While ensuring the accuracy of the entire network, it also has high performance.

But sometimes after working for a long time, I found that the water couldn't flow at all, and it was always stuck inexplicably.

It was really a fierce operation like a tiger, and I looked back and stood still.

Today we introduce a performance optimization method for neural networks. It does not need to understand the profound algorithm knowledge, it can achieve the performance of the entire optimization system, ranging from the network to the performance of operators.

And it is definitely an exponential performance improvement, and the obvious algorithmic equivalence.

How to do it? It's very simple, you just need to change the order in which operators are called.

Let me talk about the background first.

When doing AI reasoning or training, in most cases, the calculation data types of all layers (Layers) in a neural network are the same.

For example, in order to improve the recognition accuracy of the network, the operations in the neural network can use high-precision floating-point numbers, such as float32, referred to as FP32.

But sometimes for the sake of performance, it is acceptable to lose a little bit of recognition accuracy. At this time, float16, or FP16 for short, which is a half-precision data type, may be used for calculation.

The difference between FP32 and FP16 is that the data bit width of the former is twice that of the latter, so when representing the same data, the accuracy of the former is higher, but the memory usage is also larger.

For example, if you store a picture at the same time, if you use FP32, it may take up 1MB of memory, but if you use FP16 to store it, it will only take up 0.5MB of memory.

We may have heard of mixed precision inference, mixed precision training. The mixing mentioned here refers to the precision mixing. For example, there are multiple data types in a neural network.

Why can you do mixed-precision reasoning or training?

A neural network is like a building, built by layers of algorithms, and the algorithms of each layer may be different. Different algorithms are sensitive to data precision differently.

There are many algorithms that are not sensitive to data precision, such as transpose, gather, scatter, etc. These algorithms are all data handling operations, that is, pure IO operations. They do not need to perform data calculations, and do not need to consider overflow processing when data is added.

However, some algorithms are very sensitive to data precision. A typical example is the conv2d algorithm, which requires a large number of multiplication and accumulation operations, and the accumulation of data is prone to overflow. At this time, data with a higher bit width needs to be used to receive the accumulation results.

If the operation of FP32 is compared to the need to move 32 bricks, then FP16 only needs to move 16 bricks. Obviously, moving 16 bricks is more time-saving and labor-saving than moving 32 bricks when other conditions remain the same.

Therefore, in a neural network, especially a mixed training or inference network, if you encounter some data handling algorithms that move FP32, then there is a good chance that he will only move 16 bricks (FP16).

So how to do it?

First simplify a neural network, assuming a neural network has the following structure:

In this imaginary network, the output calculated by the convolutional layer (conv2d) is FP32, and then sent to the transpose layer for data handling. Since transpose is a pure IO algorithm, its output is also FP32.

The output of transpose is sent to the next layer of cast, and the cast is responsible for converting the FP32 data into FP16, so the output of the cast is FP16. Then the data of FP16 is sent to the next layer for calculation.

I don't know if you have discovered that in this network, the transpose algorithm first transports the FP32 data, and then hands it over to the cast for data type conversion, and converts it to FP16 with a lower bit width.

However, since transpose is a pure IO operation and is not sensitive to data types, we can completely advance the cast operator before transpose. In this case, transpose only needs to do FP16 data handling.

The converted network is as follows:

The result of this is: the calculation of the entire network is equivalent, but the transpose operator has changed from the original FP32 data transfer to the FP16 data transfer. For transpose, its IO performance is doubled.

This is just a very simple example.

In fact, in a real network, using this method to optimize a successful algorithm is sometimes not just a simple transpose, but a large network segment.

It can be seen that the simple operation of advancing the cast can double the performance of the entire network.

This method is simple, effective, and easy to implement. But in actual network optimization, it is sometimes ignored.

A network that can use this optimization must meet the following two conditions:

  • Must be a mixed precision network

  • There is an IO type operator before the cast operator from high bit width to low bit width

While we are racking our brains to use some advanced techniques, such as model parallelism and layer-by-layer flow to optimize the network, we might as well zoom in on the perspective, focus on the whole picture, and see if the entire network meets the above conditions. Maybe we can find it with just one glance. This is the simplest and most effective optimization point, and it is not a dream to improve network performance by a percentage from now on!

Guess you like

Origin blog.csdn.net/dongtuoc/article/details/129268221