Clear, humorous, and relaxed in-depth understanding of the YOLOv5 network structure and some details (review countless literature summaries)

A recent article about YOLOv5 detecting small target blogs has a high number of hits. Unexpectedly, YOLOv5 is still very influential. In this case, today I will take you to understand the network structure of YOLOv5 that fascinates all living beings and its improvement compared with other algorithms in a humorous, clear and relaxed style. Still a word, I hope my casual talk can help you, if you are interested, you can bookmark it, if you have any questions, please comment below, I will do my best to answer!

1. Introduction to YOLOv5

YOLOv5 is a single-stage target detection algorithm. Based on YOLOv4, this algorithm has added some new improvement ideas, so that its speed and accuracy have been greatly improved. The main improvement ideas are as follows:

Input end: In the model training phase, some improvement ideas are proposed, mainly including Mosaic data enhancement, adaptive anchor box calculation, and adaptive image scaling; benchmark network: Integrating some new
ideas in other detection algorithms, mainly including: Focus structure and CSP structure;
Neck network: The target detection network often inserts some layers between BackBone and the final Head output layer, and FPN+PAN structure is added in Yolov5; Head output layer: The
anchor box mechanism of the output layer is the same as YOLOv4, the main improvement What is the loss function GIOU_Loss during training, and DIOU_nms for prediction box screening.

On Github, Dashen has updated the 6.0 version of YOLOv5, which mainly changed the SPP structure to a serial structure, and has gone through a test experiment operation, which proves whether it is from the reduction of parameters or FLOPS and other indicators There was a significant improvement in both.

In my opinion, technology, especially the ever-changing field of artificial intelligence, of course it is necessary to learn the latest technology. (ps: I am busy with the next thesis recently, and I often feel that there are too many new technologies to learn, and I am improving all the time, so you should learn more while you are young!)

2. The network structure of YOLOv5 (personal drawing, if reprinted, please also declare, except for hard fans haha)

3. Some basic detailed knowledge that needs to be emphasized (follow the pace of learning, it will be finished soon)

 The backbone feature extraction network used by YoloV5 is CSPDarknet, which has five important features:
3.1, using the residual network Residual, the residual convolution in CSPDarknet can be divided into two parts, the backbone part is a 1X1 convolution and A 3X3 convolution; the residual edge part does not do any processing, and directly combines the input and output of the backbone. The backbone of the entire YoloV5 is composed of residual convolution.

The characteristic of the residual network is that it is easy to optimize and can improve the accuracy by adding considerable depth. Its internal residual block uses skip connections, which alleviates the problem of gradient disappearance caused by increasing depth in deep neural networks.

def Bottleneck(x, out_channels, shortcut=True, name = ""):
    y = compose(
            DarknetConv2D_BN_SiLU(out_channels, (1, 1), name = name + '.cv1'),
            DarknetConv2D_BN_SiLU(out_channels, (3, 3), name = name + '.cv2'))(x)
    if shortcut:
        y = Add()([x, y])
    return y

3.2. Using the CSPnet network structure, the CSPnet structure is not complicated, that is, the stacking of the original residual block is split into two parts: the main part continues to stack the original residual block; the other part is Like a residual edge, directly connected to the end after a small amount of processing. Therefore, it can be considered that there is a large residual edge in CSP.

def C3(x, num_filters, num_blocks, shortcut=True, expansion=0.5, name=""):
    hidden_channels = int(num_filters * expansion)  # hidden channels 
    x_1 = DarknetConv2D_BN_SiLU(hidden_channels, (1, 1), name = name + '.cv1')(x) 
    x_2 = DarknetConv2D_BN_SiLU(hidden_channels, (1, 1), name = name + '.cv2')(x)
    for i in range(num_blocks):
        x_1 = Bottleneck(x_1, hidden_channels, shortcut=shortcut, name = name + '.m.' + str(i))  
    route = Concatenate()([x_1, x_2])
    return DarknetConv2D_BN_SiLU(num_filters, (1, 1), name = name + '.cv3')(route)


3.3. The Focus network structure is used. This network structure is an interesting network structure used in YoloV5. The specific operation is to get a value for every other pixel in a picture. At this time, four independent feature layers are obtained. , and then stack four independent feature layers. At this time, the width and height information is concentrated into the channel information, and the input channel is expanded by four times. The spliced ​​feature layer becomes twelve channels compared to the original three channels.

class Focus(Layer):
    def __init__(self):
        super(Focus, self).__init__()

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1] // 2 if input_shape[1] != None else input_shape[1], input_shape[2] // 2 if input_shape[2] != None else input_shape[2], input_shape[3] * 4)

    def call(self, x):
        return tf.concat(
            [x[...,  ::2,  ::2, :],
             x[..., 1::2,  ::2, :],
             x[...,  ::2, 1::2, :],
             x[..., 1::2, 1::2, :]],
             axis=-1
        )


3.4. The SiLU activation function is used. SiLU is an improved version of Sigmoid and ReLU. SiLU has the characteristics of no upper bound and lower bound, smoothness, and non-monotonicity. SiLU outperforms ReLU on deep models. It can be regarded as a smooth ReLU activation function.

class SiLU(Layer):
    def __init__(self, **kwargs):
        super(SiLU, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        return inputs * K.sigmoid(inputs)

    def get_config(self):
        config = super(SiLU, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape

3.5. The SPP structure is used to extract features through the maximum pooling of different pooling kernel sizes to improve the receptive field of the network. In YoloV4, SPP is used in FPN, and in YoloV5, the SPP module is used in the backbone feature extraction network.

def SPPBottleneck(x, out_channels, name = ""):
    x = DarknetConv2D_BN_SiLU(out_channels // 2, (1, 1), name = name + '.cv1')(x)
    maxpool1 = MaxPooling2D(pool_size=(5, 5), strides=(1, 1), padding='same')(x)
    maxpool2 = MaxPooling2D(pool_size=(9, 9), strides=(1, 1), padding='same')(x)
    maxpool3 = MaxPooling2D(pool_size=(13, 13), strides=(1, 1), padding='same')(x)
    x = Concatenate()([x, maxpool1, maxpool2, maxpool3])
    x = DarknetConv2D_BN_SiLU(out_channels, (1, 1), name = name + '.cv2')(x)
    return x

4. Application and principle in small target field

I introduced it before, you can refer to my article, which mainly introduces two common and simple ideas and their codes.

( 37 messages) yolov5 small target detection-improve the detection accuracy of small targets 1001.2014.3001.5501        First of all, let's understand one thing, what is a small goal. Three consecutive questions in life: why is it small, where is it small, and why is it small. To put it bluntly, the so-called small target is due to its small size, coupled with the fact that the target pixels are too small when it is collected, it is generally considered that 20×20--40×40 pixels can be considered small targets.

        So what to do? Good question, let me introduce another method, there is a definition of anchor frame in YOLOV5, the question comes again, what is the anchor frame. The so-called anchor frame is the frame that the network can circle the target object. There are two methods here. One is that I wrote in the article linked above to artificially specify the size of the anchor frame according to my own data set. This requires you to be patient and change the size of the anchor frame a little bit to suit your own data set. Note that the three anchor boxes are used to detect large, medium and small targets from top to bottom. As for why, I generally understand it. If you don’t understand it, please ask me in the comments below. Well, the second is the adaptive calculation of the anchor box in letterbox.py, which uses the genetic algorithm to perform 1000 iterative updates to find the most suitable anchor box. Similarly, if you don’t understand the genetic algorithm, then you comment below and ask me (give up the programmer) haha.

Guess you like

Origin blog.csdn.net/m0_58508552/article/details/124814395