Neural Network: Classic Model && Popular Model

[1] Related concepts and calculations of IOU in target detection

IoU (Intersection over Union) is an important module in the target detection task. It is the area of ​​the intersection of GT bbox and pred bbox / the area of ​​the union of the two .

Insert image description here

Below we use coordinates (top, left, bottom, right), that is, the coordinates of the upper left corner and the coordinates of the lower right corner. Thus the IOU value can be calculated in the given two rectangles.

def compute_iou(rect1,rect2):
  # (y0,x0,y1,x1) = (top,left,bottom,right)
  S_rect1 = (rect1[2] - rect1[0]) * (rect1[3] - rect1[1])
  S_rect2 = (rect2[2] - rect2[0]) * (rect2[3] - rect1[1])

  sum_all = S_rect1 + S_rect2
  left_line = max(rect1[1],rect2[1])
  right_line = min(rect1[3],rect2[3])
  top_line = max(rect1[0],rect2[0])
  bottom_line = min(rect1[2],rect2[2])

  if left_line >= right_line or top_line >= bottom_line:
    return 0
  else:
    intersect = (right_line - left_line) * (bottom_line - top_line)
    return (intersect / (sum_area - intersect)) * 1.0

[2] Related concepts and calculations of NMS in target detection

In target detection, we can use non-maximum suppression (NMS) to post-process a large number of generated candidate frames, remove redundant candidate frames, and obtain the most representative results to speed up the efficiency of target detection.

As shown in the figure below, eliminate redundant candidate boxes and find the best bbox:

Insert image description here

Non-maximum suppression (NMS) process:

  1. First we need to set two values: a Score threshold and an IOU threshold.

  2. For each type of object, traverse all candidate boxes of that class, filter out candidate boxes whose Score value is lower than the Score threshold, and sort the candidate boxes according to their category classification probability: A < B < C < D < E < FA < B < C < D < E < FA<B<C<D<E<F

  3. First mark the maximum probability rectangular box F as the candidate box we want to retain.

  4. Starting from the maximum probability rectangular frame F, it is judged whether the intersection and union ratio (IOU) of A to E and F is greater than the IOU threshold. Assuming that the overlap between B, D and F exceeds the IOU threshold, then B and D are removed.

  5. From the remaining rectangular frames A, C, and E, select the E with the highest probability and mark it as a candidate frame to be retained. Then determine the overlap between E and A and C, and remove the rectangular frames whose overlap exceeds the set threshold.

  6. Repeat this until there are no more rectangular boxes left, and mark all the rectangular boxes that you want to keep.

  7. After each category is processed, return to step 2 to process the next category of objects again.

import numpy as np

def py_cpu_nms(dets, thresh):
  #x1、y1(左下角坐标)、x2、y2(右上角坐标)以及score的值
  x1 = dets[:, 0]
  y1 = dets[:, 1]
  x2 = dets[:, 2]
  y2 = dets[:, 3]
  scores = dets[:, 4]

  #每一个候选框的面积
  areas = (x2 - x1 + 1) * (y2 - y1 + 1)
  #按照score降序排序(保存的是索引)
  order = scores.argsort()[::-1]

  keep = []
  while order.size > 0:
    i = order[0]
    keep.append(i)
    #计算当前概率最大矩形框与其他矩形框的相交框的坐标,会用到numpy的broadcast机制,得到向量
    xx1 = np.maximum(x1[i], x1[order[1:]])
    yy1 = np.maximum(y1[i], y1[order[1:]])
    xx2 = np.minimum(x2[i], x2[order[1:]])
    yy2 = np.minimum(y2[i], y2[order[1:]])

    #计算相交框的面积,注意矩形框不相交时w或h算出来会是负数,用0代替
    w = np.maximum(0.0, xx2 - xx1 + 1)
    h = np.maximum(0.0, yy2 - yy1 + 1)
    inter = w * h
    #计算重叠度IOU:重叠面积 / (面积1 + 面积2 - 重叠面积)
    ovr = inter / (areas[i] + areas[order[1:]] - inter)

    #找到重叠度不高于阈值的矩形框索引
    inds = np.where(ovr < thresh)[0]
    # 将order序列更新,由于前面得到的矩形框索引要比矩形框在原order序列中的索引小1,所以要加1操作
    order = order[inds + 1]

  return keep

[3] What is the difference between One-stage target detection and Two-stage target detection?

Two-stage target detection algorithm : first generate a region proposal (RP) (a pre-selected box that may contain the object to be detected), and then classify the sample through a convolutional neural network. Its accuracy is higher and its speed is slower.

Main logic: Feature extraction -> Generate RP -> Classification/positioning regression.

Common Two-stage target detection algorithms include: Faster R-CNN series and R-FCN, etc.

One-stage target detection algorithm : Without RP, features are extracted directly from the network to predict object classification and location. It is faster and has slightly lower accuracy than the Two-stage algorithm.

Main logic: feature extraction -> classification/positioning regression.

Common one-stage target detection algorithms include: YOLO series, SSD and RetinaNet, etc.

Insert image description here

【4】What methods can improve the effect of small target detection?

  1. Improve image resolution. Small objects may only contain a few pixels in the bounding box, so the feature richness of small objects can be increased by increasing the resolution of the image.

  2. Increase the input resolution of the model. This is a general method with better effect, but it will bring about the problem of slower model inference speed.

  3. Tile images.

Insert image description here

  1. Data augmentation. Small target detection enhancement includes random cropping, random rotation and mosaic enhancement.

  2. Automatically learn anchor.

  3. Category optimization.

[5] What are the characteristics of the ResNet model and the problems it solves?

Every time I answer this question, I will include my selfishness. I like to explain it from the perspective of electrical automation rather than from the perspective of computers, because it reminds me of my green years in college.

ResNet is a differential amplifier . The structural design and ideological logic of ResNet is to abstract a differential amplifier in machine learning, which can enhance the correlation of the gradients of the deep network and highlight small changes during gradient backpropagation.

The characteristic of the model is the designed residual structure, which is very sensitive to small changes in the model output.

Why does adding the residual module have an effect?

Assumption: If the residual module is not used, the output is F 1 ( x ) = 5.1 F_{1} (x) = 5.1F1(x)=5.1 , the expected output isH 1 ( x ) = 5 H_{1} (x)= 5H1(x)=5 , if you want to learn the H function such thatF 1 ( x ) = H 1 ( x ) = 5 F_{1} (x) = H_{1} (x) = 5F1(x)=H1(x)=5. This change rate is relatively low and it is difficult to learn.

But if the design is H 1 ( x ) = F 1 ( x ) + 5 = 5.1 H_{1} (x) = F_{1} (x) + 5 = 5.1H1(x)=F1(x)+5=5.1 , perform a split such thatF 1 ( x ) = 0.1 F_{1} (x)= 0.1F1(x)=0.1 , then the learning goal becomes to letF 1 (x) = 0 F_{1} (x)= 0F1(x)=0 , a mapping function is learned so that its output changes from 0.1 to 0. This is relatively simple. In other words, the mapping after introducing the residual module is more sensitive to output changes.

Further understanding: If F 1 ( x ) = 5.1 F_{1} (x) = 5.1F1(x)=5.1 , now continue to train the model so that the mapping functionF 1 ( x ) = 5 F_{1} (x) = 5F1(x)=5 . Rate of change:(5.1 − 5) / 5.1 = 0.02 (5.1 - 5) / 5.1 = 0.02(5.15)/5.1=0.02 . If the residual module is not used, the learning rate may be set from 0.01 to 0.0000001. It can still be dealt with if the number of layers is low, but once the number of layers is deepened, it may not be easy to use.

If the residual module is used at this time, that is, F 1 ( x ) = 0.1 F_{1} (x) = 0.1F1(x)=0.1 changes toF 1 ( x ) = 0 F_{1} (x) = 0F1(x)=0 . This rate of change increased by 100%. Obviously, this will have a greater effect on adjusting parameter weights.

[6] What are the structure and characteristics of the ResNeXt model?

The ResNeXt model is optimized based on the ResNet model. Its main purpose is to introduce the Inception idea into ResNeXt. As shown in the figure below, the left side is the ResNet classic structure, and the right side is the ResNeXt structure, which converts single-channel convolution into multi-channel multi-channel convolution for grouped convolution .

Insert image description here

The author further proposed three equivalent structures of ResNeXt, among which the idea of ​​grouped convolution in the c structure came to mind.

Finally, let’s take a look at the comparison chart of the structural differences between ResNeXt50 and ResNet50:

Insert image description here

ResNeXt论文:《Aggregated Residual Transformations for Deep Neural Networks》

[7] What are the structures and characteristics of the MobileNet series models?

MobileNet is a lightweight network structure mainly designed for embedded devices such as mobile phones. The MobileNetv1 network structure uses depthwise Separable convolution on the basis of VGG, which greatly reduces the number of model parameters while ensuring not to lose too much accuracy.

Depthwise separable convolution is composed of Depthwise convolution and Pointwise convolution.
Depthwise convolution (DW) can effectively reduce the number of parameters and improve operation speed. However, since each feature map is convolved by only one convolution kernel, the feature map output by DW only contains all the information of the input feature map, and the information between features cannot be communicated, resulting in "poor information flow." Pointwise convolution (PW) realizes the exchange of channel feature information and solves the problem of "poor information flow" caused by DW convolution.

Insert image description here

Comparison of the calculation amount of Depthwise Separable convolution and standard convolution:

Insert image description here

Compared with standard convolution, Depthwise Separable convolution can greatly reduce the amount of calculation. And as the number of convolution channels increases, the effect becomes more obvious.

In addition, Mobilenetv1 uses stride=2 convolution to replace the pooling operation , and directly uses stride=2 to complete the downsampling during convolution, thus saving the time of using the pooling operation to perform downsampling after convolution, which can improve calculating speed.

MobileNetv1论文:《MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications》

[8] What are the structures and characteristics of MobileNet series models? (two)

MobileNetV2 introduces Linear Bottleneck and Inverted Residuals based on MobileNetV1 .

MobileNetV2 uses Linear Bottleneck (linear transformation) instead of the original nonlinear activation function to capture the manifold of interest. Experiments have proven that using Linear Bottleneck can better retain useful feature information in small networks.

Inverted Residuals are exactly the opposite of the inter-channel operation of the classic ResNet residual module. Since MobileNetV2 uses a Linear Bottleneck structure, the feature dimensions extracted are generally low. If you just use a low-dimensional feature map, the effect will not be good. If the convolutional layers all use low-dimensional feature maps to extract features, then there will be no way to extract enough overall information. If we want to extract comprehensive feature information, we need to supplement it with a high-dimensional feature map to achieve balance.

Insert image description here

MobileNetV2的论文:《MobileNetV2: Inverted Residuals and Linear Bottlenecks》

MobileNetV3 has two major innovations as a whole :

1. Complementary search technology combination: the resource-limited NAS performs module-level search; NetAdapt performs local search and fine-tunes the network layer after determining each module.

2. Network structure improvement: further reduce the number of network layers and introduce the h-swish activation function.

The author found that the swish activation function can effectively improve the accuracy of the network. However, swish is too computationally intensive. The author proposed h-swish (hard version of swish) as follows:

Insert image description here

This nonlinearity brings many advantages while maintaining accuracy. First, ReLU6 can be implemented in many software and hardware frameworks. Secondly, it avoids the loss of numerical accuracy during quantization and runs quickly.

Optimization of MobileNetV3 model structure:
Insert image description here
Insert image description here

MobileNetV3 paper: "Searching for MobileNetV3"

[9] What are the structures and characteristics of the ViT (Vision Transformer) model?

ViT model features :
1. ViT directly uses the standard Transformer structure for image classification, and its model structure does not contain CNN.
2. In order to meet the Transformer input structure requirements, the input end splits the entire image into small image blocks, and then inputs the linear embedding sequence of these small image blocks into the network. At the final output end, the Class Token form is used for classification prediction.
3. Transformer has less translation invariance and local perceptibility than CNN structure. When the amount of data is small, the effect may not be as good as CNN model. However, after pre-training on a large-scale data set and then performing transfer learning, it can Achieve SOTA performance on specific tasks.

The overall model structure of ViT :

Insert image description here

It can be specifically divided into the following parts:

  1. Image block embedding

  2. multi-headed attention structure

  3. Multilayer perceptron structure (MLP)

  4. Use operations such as DropPath, Class Token, Positional Encoding, etc.

[10] What are the structures and characteristics of the EfficientNet series models?

The Efficientnet series model is a model obtained by adjusting the search from three perspectives: depth, width, and input image resolution through grid search. From the EfficientNet-B0 to the EfficientNet-L2 version, the accuracy of the model is getting higher and higher. Similarly, the number of parameters and memory requirements will also increase.

The scale of the depth model is mainly determined by the scaling parameters of the three dimensions of width, depth, and resolution. These three dimensions are not independent of each other. For the case where the input image resolution is higher, a deeper network is needed to obtain a larger viewing field. Similarly, for higher resolution images, more channels are needed to obtain more accurate features .

EfficientNet model search logic

The interior of the EfficientNet model is implemented through multiple MBConv convolution modules. The specific structure of each MBConv convolution module is shown in the figure below. It has been experimentally proven that Depthwise Separable convolution is still very effective in large models; Depthwise Separable convolution has better feature extraction and expression capabilities than standard convolution .

In addition, the Drop_Connect method is used in the paper to replace the traditional Dropout method to prevent model overfitting. The difference between DropConnect and Dropout is that in the process of training the neural network model, it does not randomly discard the output of the hidden layer nodes, but randomly discards the input of the hidden layer nodes.

EfficientNet论文:《EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks》

To digress, you can see the author's suffocating parameter adjustment process through the paper. . .

[11] Classic model of frequently asked questions in interviews?

Questions about models are often asked during interviews. This is also a question that is not easy to quantify because the models are complex and diverse. It is possible for the interviewer to ask any question. In the logic diagram below, I have listed some questions that are useful in academic fields. Both the world and the industry are high-value models for everyone’s reference.

It is best to polish your resume with more projects, competitions, scientific research, etc., and direct model-related questions during the interview process to familiar models used in these jobs.

【12】What is the role of Focal Loss?

Focal Loss is a loss function that solves the imbalance of categories and differences in classification difficulty in classification problems, allowing the model to focus more on difficult samples during the training process.

Focal Loss starts from two-classification problems, and the same idea can be transferred to multi-classification problems.

We know that the standard loss for binary classification problems is cross entropy :

Insert image description here
For binary classification problems, we almost apply the sigmoid activation function y ^ = σ ( x ) \hat{y} = \sigma(x)y^=σ ( x ) , so the above formula can be transformed into:

Insert image description here
Here we have 1 − σ ( x ) = σ ( − x ) 1 - \sigma(x) = \sigma(-x)1σ ( x )=σ ( x ) .

The formula given in the Focal Loss paper is as follows:

where y ∈ { 1 , − 1 } y\in \{ 1,-1\}y{ 1,1 } is the real label,p ∈ [ 0 , 1 ] p\in[0,1]p[0,1 ] is the predicted probability.

We then define pt: p_{t}:pt:

Then, the above cross entropy formula can be converted into:

With the above foundation, the initial Focal Loss paper then introduced the balanced cross-entropy function :

To address the problem of category imbalance, a control weight is added to Loss. For samples belonging to the minority category, α t \alpha_{t} is increased.atThat’s it. But there is a problem with this. It only solves the balance problem between positive and negative samples, and does not distinguish between easy/difficult samples .

Why does the above formula only solve the problem of imbalance between positive and negative samples?

Because a coefficient α t \alpha_{t} is addedat, followed by pt p_{t}ptThe definition is similar, when label = 1 label=1label=When 1 , α t = α \alpha_{t}=\alphaat=α ;当label = − 1 label=-1label=When1 , α t = 1 − α \alpha_{t}= 1 - \alphaat=1α ,α \alphaThe range of α is also [0, 1] [0,1][0,1 ] . Therefore, we can setα \alphavalue of α (if1 11The number of samples in this category is compared to− 1 -11The number of samples in this category is much smaller, thenα \alphaα can be taken as0.5 0.50.5 to1 11 to increase1 11The weight of samples of this class) to control the contribution of positive and negative samples to the overall Loss.

Focal Loss

In order to distinguish difficult/easy samples, the prototype of Focal Loss appeared:

( 1 − p t ) γ (1 - p_{t})^{\gamma} (1pt)γ is used to balance the uneven proportion of difficult and easy samples,γ > 0 \gamma >0c>0 plays the role of(1 − pt) (1 - p_{t})(1pt) amplification effect. γ > 0 \gamma >0c>0 reduces the loss of easily classified samples, allowing the model to focus more on samples that are difficult to classify and easily misclassified. For example, whenγ = 2 \gamma =2c=2 , the model predicts confidencept p_{t}ptis 0.9 0.90.9 , then( 1 − 0.9 ) γ = 0.01 (1 - 0.9)^{\gamma} = 0.01(10.9)c=0.01 , that is, the FL value becomes very small; and when the model predicts the confidence levelpt p_{t}ptWhen 0.3, ( 1 − 0.3 ) γ = 0.49 (1 - 0.3)^{\gamma} = 0.49(10.3)c=0.49 , at this time its contribution to Loss becomes larger. Whenγ = 0 \gamma = 0c=When 0 , it becomes cross entropy loss.

In order to deal with the problem of imbalance between positive and negative samples, α t \alpha_{t} of balanced cross entropy is added to the above formula.atFactor, used to balance the uneven proportion of positive and negative samples, and finally get Focal Loss:

The optimal experimental value given in the Focal Loss paper is at = 0.25 a_{t}= 0.25at=0.25 γ = 2 \gamma = 2 c=2

[14] What are the classic lightweight face detection models?

Face detection is a sub-task compared to general target detection. Compared with the general target detection task that detects 1,000 categories at every turn, the face detection task mainly focuses on single-type target detection of faces. Using a general target detection model is too extravagant and feels a bit like "killing a chicken with a sledgehammer" and has a large number of parameters. Redundancy will affect the practicality of the deployment side . Therefore, for face detection tasks, the academic community has proposed many lightweight face detection models. Rocky will introduce some representative ones here:

  1. libfacedetection
  2. Ultra-Light-Fast-Generic-Face-Detector-1MB
  3. A-Light-and-Fast-Face-Detector-for-Edge-Devices
  4. CenterFace
  5. DBFace
  6. RetinaFace
  7. MTCNN

[15] What are the structures and characteristics of the LFFD face detection model?

Rocky was asked many times about the LFFD model during internship/campus recruitment interviews and the situation where the interviewer wanted to extract LFFD-related algorithm solutions, which shows that the LFFD model is still quite valuable in the industry . Now Rocky will take everyone to learn about the LFFD model. knowledge:

LFFD (A-Light-and-Fast-Face-Detector-for-Edge-Devices) is suitable for single target detection tasks such as faces, pedestrians, and vehicles. It has the characteristics of fast speed, small model, and good effect. LFFD is an Anchor-free method. It uses receptive fields instead of Anchors, and extracts 8-way feature maps on the backbone structure to detect faces from small to large. The detection module is divided into two categories and bounding box regression .

LFFD model structure

We can see that the LFFD model mainly consists of four parts: tiny part, small part, medium part, and large part.

The BN layer is not used in the model because the BN layer will slow down the inference speed by 17%. It mainly uses downsampling as fast as possible to maintain 100% face coverage.

Main features of LFFD:

  1. The structure is simple and direct, and it is easy to deploy in mainstream AI end-side devices.

  2. The ability to detect small targets is outstanding. In extremely high-resolution (such as 8K or larger) images, it can detect targets as large as 10 pixels in between;

LFFD loss function

The LFFD loss function is the weighted sum of regression loss and classification loss.

The classification loss uses cross-entropy loss.

The regression loss uses the L2 loss function.

LFFD paper address: LFFD: A Light and Fast Face Detector for Edge Devices paper address

【16】The structure and characteristics of U-Net model?

The U-Net network structure is as follows:

U-Net network structure

Features of U-Net network:

  1. Fully convolutional neural network: use 1 × 1 1\times11×1 convolution completely replaces the fully connected layer, making the input size of the model unrestricted.
  2. The left half of the network is the contracting path: using convolution and max pooling layers to downsample the feature map.
  3. The right half of the network is the expansion path: use transposed convolution to upsample the feature map, and concat it with the feature map generated by the corresponding layer of the contraction path. Upsampling can supplement feature information, plus concat with the feature map of the shrinkage path of the left half of the network (making the two feature maps the same size through the crop operation), which is equivalent to a fusion between high-resolution and high-dimensional features. compromise .
  4. U-Net proposes a refreshing overall encoder-decoder structure, which makes U-Net full of vitality and strong adaptability.

U-Net has very rich applications in medical images, defect detection and traffic scenes. It can be said that in actual image segmentation scenarios, U-Net is a universal Baseline.

U-Net's paper address: U-Net

[17] What is the structure and characteristics of the RepVGG model?

The basic architecture of the RepVGG model consists of 20 multi-layer 3 × 3 3\times33×It consists of 3 convolutions and is divided into 5 stages. The first layer of each stage is downsampling with stride=2, and each convolution layer uses ReLU as the activation function.

Key features of RepVGG:

  1. 3 × 3 3\times33×The computational density of 3 convolutions on the GPU (theoretical operations divided by the time used) can be up to four times that of 1x1 and 5x5 convolutions.
  2. The calculation efficiency of the straight single-channel structure is higher than that of the multi-channel structure.
  3. The straight single-channel structure takes up less memory than the multi-channel structure.
  4. The single-channel architecture has better flexibility and is easier to perform further operations such as model compression.
  5. RepVGG contains only one operator, which facilitates chip manufacturers to design special chips to improve end-side AI efficiency.

So what enables RepVGG to achieve the SOTA effect in the above situation?

The answer is structural re-parameterization .

Structural reparameterization logic

In the training phase, a multi-branch model is trained and equivalently converted into a single-branch model. In the deployment phase, deploy a single-channel model. In this way, you can take advantage of the advantages of multi-branch model training (high performance) and the advantages of single-channel model inference (fast speed, memory saving).

More detailed knowledge of structural reparameterization will be introduced in subsequent chapters, so everyone is looking forward to it!

【18】The core idea of ​​GAN?

In 2014, Ian Goodfellow first proposed the concept of GAN. Yann LeCun once said: "Generative adversarial networks and their variants have become one of the most important ideas in the field of machine learning in the past 10 years . " The proposal of GAN allowed the generative model to stand on the bright stage of the deep learning wave again, and began to chat and laugh with the discriminative model.

GAN consists of generator GGG and discriminatorDDD composition. Among them, the generator is mainly responsible for generating corresponding sample data, and the input is generally noise ZZrandomly sampled from a Gaussian distribution.Z . The main responsibility of the discriminator is to distinguish the samples generated by the generator fromgt (G round T ruth) gt (GroundTruth)g t ( G ro u n d T r u t h ) sample, the input is generallygt gtg t samples and corresponding generated samples, what we want is to pairgt gtThe closer the confidence level of g t sample output is to1 11 is better, and the confidence in the generated sample output is closer to0 00 is better. Different from general neural networks, GAN needs to train the generator and discriminator at the same time during training, so its training is relatively difficult.

In the first paper that proposed GAN, the generator was likened to a criminal who prints counterfeit money, and the discriminator was treated as a policeman. Criminals are working hard to make the counterfeit money they print look realistic, and police are constantly improving their ability to spot counterfeit money. The two compete with each other, and as time goes on, they will become stronger and stronger. The same is true in image generation tasks, where the generator continuously generates fake images that are as realistic as possible. The discriminator determines that the image is gt gtg t image, or generated image. The two continue to optimize the game, and finally the image generated by the generator makes it completely impossible for the discriminator to distinguish between true and false.

The adversarial idea of ​​GAN is mainly realized by its objective function . The specific formula is as follows:

Insert image description here

The above formula seems complicated, but it is not. Looking beyond the details, the core logic of the entire formula is actually a min-max problem. When the boundaries of deep learning mathematical applications expand here, GAN begins to shine .

Then we get into the details. We can look at this formula in two parts, namely the discriminator minimizing angle and the generator maximizing angle. From the discriminator perspective, we want to maximize this objective function, because in the first part of the publicity, it means gt gtg t sample(x ~ P data) (x ~ Pdata)( x ~ P d a t a ) The confidence of the output after inputting the discriminator, of course, is closer to1 11 is better. The second part of the formula represents the generated sample (G (z))output by the generator( G ( z ) ) is then input into the discriminator for binary classification. Of course, the confidence level of its output is closer to0 00 is better, so1 − D ( G ( z ) ) 1 - D(G(z))1The closer D ( G ( z )) is to 1 11 is better.

From the generator perspective, we want to minimize the maximum value of the discriminator objective function . The maximum value of the discriminator objective function represents the JS divergence between the real data distribution and the generated data distribution. The JS divergence can measure the similarity of the distribution. The closer the two distributions are, the smaller the JS divergence (the JS divergence is the initial value). It was proposed in the GAN paper, but deficiencies will be found in practical applications. Later papers have successively proposed many new loss functions for optimization)

[19] Classic GAN model that is often asked in interviews?

  1. Original GAN ​​and its training logic
  2. DCGAN
  3. CGAN
  4. WRONG
  5. LSGAN
  6. PixPix series
  7. CysleGAN
  8. SRGAN series

【20】Related knowledge of FPN (Feature Pyramid Network)

Innovation points of FPN

  1. Design feature pyramid structure
  2. Extract multi-layer features (bottom-up, top-down)
  3. Multi-layer feature fusion (lateral connection)

The structure of the feature pyramid is designed to solve the multi-scale problem in target detection and greatly improve the detection performance of small objects without basically increasing the calculation amount of the original model.

It turns out that many target detection algorithms only use high-level features for prediction. High-level features have rich semantic information, but have low resolution and rough target locations. Assume that in a deep network, a pixel in the final high-level feature map may correspond to the output image 20 × 20 20 \times 2020×20 pixel area, then less than20 × 20 20 \times 2020×The features of a small object of 20 pixels have most likely been lost. At the same time, the low-level feature semantic information is relatively small, but the target position is accurate, which is helpful for small target detection. FPN fuses high-level features with low-level features, thereby simultaneously utilizing the high resolution of low-level features and the rich semantic information of high-level features, and performs independent prediction of multi-scale features, significantly improving the detection effect of small objects.

FPN structure

Traditional ideas to solve this problem include:

  1. Image pyramid, i.e. multi-scale training and testing. However, this method is computationally intensive and time-consuming.
  2. Feature layering, that is, each layer outputs the detection results of the corresponding scale resolution, such as the SSD algorithm. But in fact, different depths correspond to different levels of semantic features. The shallow network has a high resolution and learns more detailed features. The deep network has a low resolution and learns more semantic features. Different features alone are not enough.

Main modules of FPN

  1. Bottom-up pathway
  2. Top-down path
  3. Lateral connections

Bottom-up pathway

The bottom-up line is the forward propagation process of the convolutional network. During forward propagation, the size of the feature map can change at some layers.

Top-down path (top-down line) and lateral connections (horizontal links)

The top-down line is an upsampling process, while the horizontal link fuses the results of the top-down line with the structure of the bottom-up line.

The upsampled feature map and the downsampled feature map of the same size are added and fused pixel by pixel (element-wise addition), where the bottom-up feature first goes through 1 × 1 1\times 11×1 convolution layer, the purpose is to reduce the channel dimension.

FPN application

In the paper, FPN is directly improved on Faster R-CNN, and its backbone is ResNet101. FPN is mainly used in the two modules of RPN and Fast R-CNN in Faster R-CNN.

FPN+RPN:

By combining FPN and RPN, the input of RPN will become a multi-scale feature map, and multiple RPN head layers will be connected to the output of RPN to satisfy the classification and regression of anchors.

FPN+Fast R-CNN:

The overall structural logic of Fast R-CNN remains unchanged, and the FPN idea is introduced in the backbone part for transformation.

【21】Related knowledge of SPP (Spatial Pyramid Pooling)

In the field of target detection, many detection algorithms finally use fully connected layers, resulting in a fixed input size. When encountering image input with mismatched sizes, you need to use operations such as crop or warp to match the image size and algorithm input. These two methods may cause different problems: the cropped area may not contain the entire object; the deformation operation causes useless geometric distortion of the target, etc.

What SPP does is to add an SPP layer after the convolutional layer to pull the features map into a fixed-length feature vector. Then the feature vector is input into the fully connected layer . This will solve the above embarrassing problem.

The difference between crop/warp and SPP

Advantages of SPP:

  1. SPP can ignore input dimensions and produce fixed-length output.
  2. SPP uses multiple scale sliding kernels instead of only one size sliding window for pooling.
  3. SPP extracts features on feature maps of different sizes, increasing the richness of extracted features.

SPP logic diagram

[22] The meaning of AP, AP50, AP75, mAP and other indicators in target detection

AP: Area under the PR curve.

PR curve

AP50: AP value when fixed IoU is 50%.

AP75: AP value when fixed IoU is 75%.

AP@[0.5:0.95]: Divide the IoU value every 5% from 50% to 95%, and average these 10 sets of AP values.

mAP: Calculate AP for all categories and then take the average.

mAP@[.5:.95] (i.e. mAP@[.5,.95]): indicates that at different IoU thresholds (from 0.5 to 0.95, step size 0.05) (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95).

【23】How to generate anchor in YOLOv2?

The K-means algorithm is introduced in YOLOv2 to generate anchors , which can automatically find better anchor width and height values ​​for initialization of model training.

However, if the Euclidean distance in classic K-means is used as a metric, it means that a larger Anchor will produce a larger error than a smaller Anchor, and the clustering results may deviate.

Since target detection mainly cares about the IOU of anchor and ground true box (gt box), it does not care about the size of the two. Therefore, it is more appropriate to use IOU as a metric, that is, to increase the IOU value. Therefore, YOLOv2 uses the IOU value as the criterion:

d ( g t b o x , a n c h o r ) = 1 − I O U ( g t b o x , a n c h o r ) d(gt box,anchor) = 1 - IOU(gt box,anchor) d(gtbox,anchor)=1IOU(gtbox,anchor)

The specific anchor generation steps are roughly the same as classic K-means, which will be introduced in detail in the next chapter. The main difference is that the metric used is d ( gtbox , anchor ) d(gt box, anchor)d(gtbox,an c h or ) , and use the anchor as the center of the cluster.

Guess you like

Origin blog.csdn.net/weixin_51390582/article/details/135173150