Basic knowledge and terminology

Two-stage: The two-stage algorithm usually focuses on finding out where the object appears in the first stage, obtains the suggestion box, and ensures sufficient accurate recall, and then focuses on classifying the suggestion box in the second stage to find a more accurate Position, typical algorithms such as Faster RCNN. Two-order algorithms are usually more accurate, but slower. Of course, there are more order algorithms such as Cascade RCNN.

First-order: The first-order algorithm combines the two stages of the second-order algorithm into one , and completes the prediction of the location and category of the object in one stage. The method is usually simpler and relies on feature fusion, Focal Loss, etc. Network experience, the speed is generally faster than the two-order network, but the accuracy will be lost , typical algorithms such as SSD, YOLO series, etc.

Anchor: It is an epoch-making idea that first appeared in Faster RCNN. It is essentially a series of a priori boxes of different sizes, widths and heights, evenly distributed on the feature map, using features to predict the category of these Anchors, and with The offset of the real object border. Anchor is equivalent to providing a ladder for object detection, so that the detector does not directly predict objects from scratch, and the accuracy is often high. Common algorithms include Faster RCNN and SSD.

IOU: That is, the intersection and union ratio, which is the overlap rate of the generated candidate frame (candidate bound) and the original mark frame (ground truth bound), that is, the ratio of their intersection and union . For IoU, we usually choose a threshold, such as 0.5, to determine whether the predicted frame is correct or wrong. When the IoU of two boxes is greater than 0.5, we consider it a valid detection, otherwise it is an invalid match.

Recall rate (Recall, R): the ratio of the currently detected label frame to all label frames.

Accuracy (Precision, P): It is the ratio of the correct prediction frame in the currently traversed prediction frame.

mAP: Mean Average Precision (mean Average Precision). This indicator is usually used to evaluate the quality of a model. Here, AP refers to the detection accuracy of a category, and mAP is the average precision of multiple categories.

AP: The area of ​​the PR curve, comprehensively considering the accuracy rate under different recall rates, and will not have any preference for P and R.

AP=\int_{0}^{1}PdR

Shrink the image (or called subsampled or downsampled): There are two main purposes: 1. Make the image conform to the size of the display area; 2. Generate a thumbnail of the corresponding image

Enlarging the image (or called upsampling or interpolating): the main purpose is to enlarge the original image so that it can be displayed on a higher resolution display device

Receptive field: The pixel points on the feature map (feature map) output by each layer of the convolutional neural network are mapped back to the area size on the input image. The popular explanation is that a point on the feature map, relative to the size of the original image, is also the area where the convolutional neural network features can see the input image.

region proposal: candidate region

RoI: RoI is the abbreviation of Region of Interest, which refers to the "region of interest" of a picture, which is used in the RCNN series of algorithms.

RoI pooling: The implementation process is shown in the figure. In order to output a fixed size (eg 7x7) feature map, two quantization operations (rounding) are required. The first time occurs when the image is reduced by the network layer, and the second time is to map After the RoI area is divided into blocks of the same size when

The implementation process of Roi pooling

RoI Align: The implementation process of RoI Align is shown in the figure below. RoI Align is an improved version of RoI Pooling. RoI Align cancels the quantization operation. It mainly uses the virtual pixel method (bilinear interpolation) to obtain the image value on the pixel whose coordinates are floating point numbers, thus converting the entire feature aggregation process into a continuous operation.

 

layers in a neural network

Convolution layer: The essence of convolution is to use the parameters of the convolution kernel to extract the characteristics of the data, and to obtain the result through matrix point multiplication and summation operations.

Activation function layer: If the neural network is only composed of linear convolution operation stacks, it cannot form a complex expression space, and it is difficult to extract high-semantic information. Therefore, it is necessary to add nonlinear mapping, also known as The activation function can approximate any nonlinear function to improve the expressive ability of the entire neural network. In object detection tasks, commonly used activation functions are Sigmoid, ReLU, and Softmax functions.

Pooling layer: In the convolutional network, a pooling (Pooling) layer is usually added between the convolutional layers to reduce the parameter amount of the feature map, improve the calculation speed, and increase the receptive field. It is a downsampling operation. Pooling is a strong prior, which can make the model pay more attention to global features rather than local occurrences. This dimensionality reduction process can retain some important feature information, improve fault tolerance, and to a certain extent play a role in preventing overfitting.

Dropout layer: In deep learning, when there are too many parameters and few training samples, the model is prone to overfitting. Overfitting is a common problem of many deep learning and even machine learning algorithms. It is specifically manifested in the high prediction accuracy rate on the training set, while the accuracy rate drops sharply on the test set. In 2012, Hinton et al. proposed the Dropout algorithm, which can effectively alleviate the occurrence of overfitting and achieve a certain regularization effect.

BN layer: In order to pursue higher performance, the convolutional network is designed deeper and deeper, but the network becomes difficult to train, converge and adjust parameters. The reason is that the weak changes in shallow layer parameters will be amplified after multi-layer linear transformation and activation function, changing the input distribution of each layer, causing the deep network to continuously adjust to adapt to these distribution changes, and ultimately making it difficult for the model to train and converge . The phenomenon that the distribution of internal node data changes due to parameter changes in the network is called ICS (Internal Covariate Shift). The ICS phenomenon tends to make the training process fall into the saturation zone and slow down the convergence of the network. The ReLU mentioned above solves the phenomenon of gradient saturation to a certain extent from the perspective of the activation function, while the BN layer proposed in 2015 avoids the parameter from falling into the saturation zone from the perspective of changing the data distribution. Due to the superior performance of the BN layer, it is already the "standard configuration" in the current convolutional network.

Fully Connected Layer: Fully Connected Layers (Fully Connected Layers) are generally connected to the back of the feature map output by the convolutional network. The feature is that each node is connected to all nodes in the upper and lower layers, and the input and output are extended into one-dimensional vectors. Therefore, from In terms of the amount of parameters, the amount of parameters in the fully connected layer is the largest. The fully connected layer further maps the feature map abstracted by the convolution to a label space of a specific dimension to obtain a loss or output a prediction result. The most fatal problem of the fully connected layer is its huge amount of parameters. In many scenarios, we can use the global average pooling layer (Global Average Pooling, GAP) to replace the fully connected layer.

The use of GAP has the following three advantages: 1. Using pooling to achieve dimensionality reduction, greatly reducing the number of network parameters 2. Combining feature extraction and classification into one can prevent overfitting to a certain extent 3. Due to the removal With a fully connected layer, the input of any image scale can be realized.

official

Matrix size calculation formula after convolution

N=(W-F+2P)/S+1

where the input image size W*W

        Filter size F*F

        step size S

        Padding pixel P

Guess you like

Origin blog.csdn.net/a545454669/article/details/123141504