Super simple convolution and addition fusion, pseudocode version

Please do not reprint original articles at will. If necessary, please contact the author.

A few days ago, I wrote an article on the fusion of convolution and addition in a convolutional neural network (CNN). A classmate asked, hoping to write a version with code for better understanding.

My first reaction is that there are so many details in how to write the code version. After thinking about it, in fact, what the student wanted to know was not the details, but a general process.

Then I will simply write a pseudo-code version, and write the general code idea.

As for how to implement the specific convolution algorithm, it is recommended to chatGPT, or look at the open source deep learning warehouse.

If you haven't read the previous article, you can take a look at it: super simple convolution and addition fusion

Still take the graph in resnet50 as an example to do a fusion of convolution and addition.

Under normal circumstances, the above network fragment is roughly like this when it is executed:

BatchNorm -> Relu -> Conv -| Add的左分支
                           |                -> Add
                  -> Conv -| Add的右分支

Writing pseudo code is actually a sequential call logic, such as

bn_out = Batch_normal();
relu_out = Relu(bn_out);
conv_out_left = Conv2d(relu_out)
conv_out_right = Conv2d(...)
add_out = Add(conv_out_left, conv_out_right)

Once the fusion is completed, Conv and Add in the red box in the above figure become an operator, and here we call the operator after fusion the ConvAdd operator.

Thus, the above diagram becomes the following diagram:

At this point, the calling logic of the entire network segment becomes:

bn_out = Batch_normal();
relu_out = Relu(bn_out);
conv_out_right = Conv2d(...)
add_out = ConvAdd(relu_out, conv_out_right)

After ConvAdd is regarded as an operator, many parallel operations of fusion, image disassembly, and pipeline can be performed.

Assuming that the network is now running on an ASIC chip, the convolution calculation module and the addition calculation module on the chip are independent of each other without any dependencies.

Here it is assumed that the size of the Feature Map input by convolution is [n, hi, wi, ci], and the convolution kernel is [co, kh, kw, ci].

The remaining parameters are simplified, the convolution pad is simplified to 0, the stride is simplified to 1, and the dilation is simplified to 1.

The output of the convolution is [n, ho, wo, co].

Then the addition after the convolution, the addition of the two tensors executed, becomes [n, ho, wo, co] + [n, ho, wo, co] = [n, ho, wo, co].

Then, we cut the input of the convolution (assuming it is a picture below) into two parts in the H direction.

Then, after calculating the entire graph, it is necessary to call two convolution operations, the first time to calculate the upper part, and the second time to calculate the lower part.

In the two calculations, there is no relationship between most of the pixels, and there may be dependencies only at the junction of the two parts. (The condition of dependence is that the kernel is greater than 1, or the stride is greater than 1. These cases are not considered for now, and the two parts of pixels are considered to be irrelevant for the time being).

Then for the first convolution calculation, the input of the calculation is  [n, hi/2, wi, ci] , and the output of the calculation is  [n, ho/2, wo, co] . At this time, the first half of hi is calculated. Expressed in black.

Then for the second convolution calculation, the input of the calculation is  [n, hi/2, wi, ci] , and the calculation output is  [n, ho/2, wo, co] . At this time, the second half of hi is calculated. In italics.

Similarly, the addition will be divided into two calculations, corresponding to the two outputs of the calculation convolution:

The first addition calculates the output of the first convolution, ie  [n, ho/2, wo, co] .

The second addition calculates the output of the second convolution, ie  [n, ho/2, wo, co] .

Then, in the case of two calculations, in the ConvAdd operator, the internal implementation logic should roughly be:

conv_out_part1 = Conv2d(part1)
conv_out_part2 = Conv2d(part2)
add_out_part1 = Add(conv_out_part1)
add_out_part2 = Add(conv_out_part2)

But this is obviously not possible, because it is still executed serially: after the first convolution is executed, the second convolution is executed, and after the second convolution is executed, the first addition is executed... Then how to make Conv and Add What about in parallel?

Through observation, it can be found that the first Add does not depend on the second Conv, and we have assumed that the Conv operation module and the Add module on the Asic chip are completely independent.

Then the method to parallelize the second Conv and the first Add is: after the first Conv is calculated, directly calculate the first Add, and then parallelize the second Conv at the same time. At this time, the implementation of the code is roughly like this:

conv1 = Conv2d(part1)
-----------------------
add1 = Add(conv1)      
conv2 = Conv2d(part2) 
-----------------------
add2 = Add(conv2)

At this time, Add and conv are parallelized in a pipeline stage in the middle.

The so-called one pipeline level refers to the code between the two "------" in the above code segment, which is called in one pipeline level.

Then if the picture is split into more parts, there will be more pipeline stages that can be parallelized.

For example, if it is divided into 4 parts, then Conv and Add in 3 pipeline stages can be parallelized.

conv1 = Conv2d(part1)
-----------------------
add1 = Add(conv1)      
conv2 = Conv2d(part2) 
-----------------------
add2 = Add(conv2)
conv3 = Conv2d(part3) 
-----------------------
add3 = Add(conv3)      
conv4 = Conv2d(part4) 
-----------------------
add4 = Add(conv4)      
-----------------------

It should be noted that in the above pseudo code, each "-----" actually represents a synchronization point. When actually deploying to run on hardware, it is necessary to set synchronization operations on these synchronization points to complete all calculation operations in the previous pipeline level.

Commonly used synchronization operations include some synchronization instructions or barrier instructions. Assuming we use the barrier instruction for synchronization, the complete pseudocode above is:

conv1 = Conv2d(part1)
barrier()

add1 = Add(conv1)      
conv2 = Conv2d(part2) 
barrier()

add2 = Add(conv2)
conv3 = Conv2d(part3) 
barrier()

add3 = Add(conv3)      
conv4 = Conv2d(part4) 
barrier()

add4 = Add(conv4)      
barrier()

Of course, the above code looks too long, it can be written in the form of a loop, or take the H direction split into 4 parts as an example:

conv1 = Conv2d(part1);
barrier();

for i in range(1, 4):
    add_i = Add(convi)
    Conv_i+1 = Conv2d(part_i+1)
    barrier()

add4 = Add(conv4)      
barrier()

The logic of the pseudocode is still very simple, the key is to understand the idea of ​​Conv and Add parallel pipeline.

This method can be used in many fusion scenarios, and it is not limited to the two operators Conv and Add, nor is it limited to a certain neural network.

As long as the computing units on the hardware can be executed in parallel, and there are dependent layers on the neural network structure diagram, almost all can be integrated in this way to improve the overall performance.


Click the official account below to follow the author and learn deep learning together.

Guess you like

Origin blog.csdn.net/dongtuoc/article/details/129395605