SLAM interview notes (8) — Computer vision interview questions

Table of contents

Question 1: Algorithm classification of target detection

Question 2: Composition of convolutional neural network

Question 3: The role of the input layer

Question 4: The role of convolutional layer 

Question 5: Convolution kernel type

Question 6: The role of 1×1 convolution kernel

Question 7: Is the larger the convolution kernel, the better?

Question 8: Checkerboard effect and solution

Question 9: How to reduce convolutional layer parameters

Question 10: Neural Network Visualization Tools

Question 11: The role of pooling layer 

Question 12: The difference between convolutional layer and pooling layer

Question 13: The role of activation function layer

Question 14: The role of the fully connected layer

Question 15: How to improve the generalization ability of convolutional neural networks

Question 16: Let’s talk about the normalization methods BN, LN, IN, and GN. 

Question 17: If the softmax formula is multiplied by a coefficient a, how does the probability distribution change?​ 

Question 18: How to solve the problem of imbalance between positive and negative samples 

Question 19: Reasons why the training network does not converge 

Question 20: Optimization algorithms, Adam, Momentum, Adagard, SGD characteristics

Question 21: Reasons why small targets are difficult to detect


Question 1: Algorithm classification of target detection

Target detection algorithms based on deep learning are mainly divided into two categories:

two-stage target detection algorithm

  • The main idea: first generate the Region Proposal (RP), and then classify the samples through the convolutional neural network.
  • Task route: feature extraction - generation of target candidate areas - classification/positioning regression.

one-stage target detection algorithm

  • The main idea: No need to generate regions, extract features directly from the network to predict object classification and location
  • Task route: feature extraction-classification/positioning regression.

Question 2: Composition of convolutional neural network

The basic structure of a convolutional neural network consists of the following parts: input layer, convolution layer, pooling layer, activation function layer and full-connection layer. .

Question 3: The role of the input layer

In CNNs that process images, the input layer generally represents the pixel matrix of an image. A picture can be represented by a three-dimensional matrix. The length and width of the three-dimensional matrix represent the size of the image, while the depth of the three-dimensional matrix represents the color channels of the image. For example, the depth of a black and white image is 1, while in RGB color mode, the depth of the image is 3.

Question 4: The role of convolutional layer 

The core of the convolutional neural network is the convolution layer, and the core part of the convolution layer is the convolution operation.

Do the inner product of the image (different data window data) and the filter matrix (a set of fixed weights: because the multiple weights of each neuron are fixed, it can be regarded as a constant filter) The operation (element-by-element multiplication and summation) is the so-called convolution operation, which is also the source of the name of the convolutional neural network.

In CNN, the filter performs convolution calculation on local input data. After each local data in a data window is calculated, the data window is continuously moved and slid until all data is calculated. In this process, there are several parameters:

  • depth: The number of neurons determines the depth thickness of the output. Also represents the number of filters.
  • Step stride: Determine how many steps to slide to the edge.
  • Padding value zero-padding: Add a few circles of zeros on the outer edge so that you can slide from the initial position to the end position in steps. In layman's terms, it is for the total length. Divisible by the step size.

Question 5: Convolution kernel type

Transposed convolution

Sometimes we need to do something like increasing the size of the input (also called "upsampling"). A process in which the original feature matrix is ​​first filled to expand its dimension to fit the convolution target output dimension, and then a common convolution operation is performed. Transposed convolution is commonly used to detect small targets in the field of target detection and restore the scale of the input image in the field of image segmentation.

Dilated/Atrous convolution

A parameter called the Dilation Rate is introduced so that a convolution kernel of the same size can obtain a larger receptive field of view. Correspondingly, fewer parameters are used than ordinary convolution under the premise of the same receptive field of view. With the same convolution kernel size of 3x3, dilated convolution can extract regional features in the 5x5 range and is widely used in the field of real-time image segmentation.

separable convolution

The standard convolution operation is a convolution operation on the original image HxWxC in three directions at the same time. Assuming there are three convolution kernels of the same size, such a convolution operation requires HxWxCxK parameters; if the length, width and The convolution operation in the depth direction is separated and becomes a two-step convolution operation that first convolves with the HXW direction and then convolves with the C direction. Then there are also K convolution kernels of the same size, and only (HxW+ C) XK parameters can obtain the same output scale. SeperableConvolution is usually used in model compression or some lightweight convolutional neural networks, such as MobileNet, Xception, etc.

Question 6: The role of 1×1 convolution kernel

For 1x1The role of the convolution kernel can be summarized as the following points

  • Increase network depth (increase the number of nonlinear mappings)
  • Dimensionality increase/dimensionality reduction
  • Cross-channel information exchange
  • Reduce convolution kernel parameters (simplify the model)

Question 7: Is the larger the convolution kernel, the better?

Setting a larger convolution kernel can obtain a larger receptive field. However, this large convolution kernel will lead to a significant increase in the amount of calculation, which is not conducive to training deeper models, and the corresponding computing performance will also be reduced. Later convolutional networks (VGG, GoogLeNet, etc.) found that by stacking two 3X3 convolution kernels, the same visual field of view as the 5X5 convolution kernel can be obtained, while the number of parameters will be smaller (3X3X2+1<5X5X1+1). The 3X3 convolution kernel is widely used in many convolutional neural networks.

        However, this does not mean that larger convolution kernels have no effect. Larger convolution kernels can still be used when applying convolutional neural networks in some fields. When applying convolutional neural networks in the field of natural language processing, neural networks are usually composed of shallower convolutional layers, but text features sometimes need to have a wider receptive field so that the model can combine more features. (such as phrases and characters), it would be a better choice to use a larger convolution kernel at this time.

        To sum up, there is no absolute advantage or disadvantage in the size of the convolution kernel. It depends on the specific application scenario. However, both extremely large and extremely small convolution kernels are inappropriate. A single 1X1 extremely small convolution kernel only It can be used as a separate convolution but cannot effectively combine the original features of the input. A very large convolution kernel usually combines too many meaningless features, thus wasting a lot of computing resources.

Question 8: Checkerboard effect and solution

From the above phenomenon, we know that when the filter size is not divisible by the convolution step size, the transposed convolution will have uneven overlap, causing certain parts of the image to be darker than other parts, thus causing a checkerboard effect.

How to avoid and mitigate the checkerboard effect:

(1) Confirm that the size of the filter used is divisible by the convolution step size to avoid overlapping problems

(2) Transpose convolution with a convolution step size of 1 can be used to reduce the checkerboard effect.

Reference article: Summary of convolution operations (3) - Causes and solutions to the checkerboard effect of transposed convolution - Zhihu

Question 9: How to reduce convolutional layer parameters

  • Use stacked small convolution kernels instead of large convolution kernels: Two 3X3 convolution kernels in the VGG network can replace one 5X5 convolution kernel.
  • Use separated convolution operation: Separate the original KXKXC convolution operation into two parts of KXKX1 and 1X1XC operations.
  • Add 1X1 convolution operation: is similar to separated convolution, but the number of channels is variable, and a 1X1XC2 convolution kernel is added before the KXKXC convolution.
  • Use the pooling operation before the convolutional layer:Pooling can reduce the input feature dimension of the convolutional layer

Question 10: Neural Network Visualization Tools

Visualization tools for neural networks include Netron, draw_convnet, NNSVG, PlotNeuralNet, Tensorboard, Caffe, etc.

Reference article:[Deep Learning | Machine Learning] Full of useful information | Nearly 10,000 words summarized 12 amazing neural network visualization tools! _Kuan~’s blog on the journey-CSDN Blog

Question 11: The role of pooling layer 

The pooling layer is also called the downsampling layer. Its function is to filter the features in the receptive field and extract the most representative features in the area. It can effectively reduce the output feature scale, thereby reducing the model The required number of parameters. Mainly includeAverage Pooling (Average Pooling), Maximum Pooling ( Max Pooling) etc. Simply put, pooling is to specify a value in the area to represent the entire area. Hyperparameters of the pooling layer: pooling window and pooling step size. The pooling operation can also be regarded asa convolution operation.

Question 12: The difference between convolutional layer and pooling layer

The convolutional layer and the pooling layer have certain similarities in structure. They both extract features within the receptive field and obtain outputs of different dimensions according to the step size settings. However, their internal operations are essentially different.
convolution layer Pooling layer
structure When zero-padding, the output dimension remains unchanged but the number of channels changes. Usually the feature dimension will be reduced and the number of channels remains unchanged.
stability When the input features change slightly, the output results will change Subtle changes in the sensory field do not affect the output results
effect Extract local correlation features within the receptive field Extract generalization features within the receptive field and reduce the dimensionality
Parameter quantity Related to convolution kernel size and number of convolution kernels No additional parameters are introduced

Question 13:Activation function layer effect

Activation function (nonlinear activation function, if the activation function uses a linear function, then its output is still a linear function.) But using a nonlinear activation function can obtain a nonlinear output value. Common activation functions include Sigmoid, tanh, Relu, etc. Generally, we use Relu as the activation function of the convolutional neural network. The Relu activation function provides a very simple nonlinear transformation method. The function image is as follows:

Question 14:The role of the fully connected layer

After multiple rounds of convolutional layers and pooling layers, at the end of the CNN, there are usually 1 to 2 fully connected layers to give the final classification result. After several rounds of convolution and pooling layers, it can be considered that the information in the image has been abstracted into features with higher information content. We can think of convolutional layers and pooling layers as the process of automatic image feature extraction. After the extraction is completed, a fully connected layer still needs to be used to complete the classification task

Question 15: How to improve the generalization ability of convolutional neural networks

  • Use more data: It is possible to label more training data. This is the most ideal way to improve generalization ability. More data allows the model to get more complete results. Learning naturally improves generalization ability.
  • Use a larger batch_size:Under the same number of iterations and learning rate, using more data in each batch will help the model learn better model, the model output results will be more stable.
  • Data oversampling:In many cases, the data we get have uneven categories. At this time, the model overfits a large number of data of a certain type, resulting in its The output results are biased towards this type of data. At this time, if we oversample other types of data, making the data volume more balanced can improve the generalization ability to a certain extent.
  • Data enhancement:Data enhancement refers to transforming images through some geometric operations when the data is limited, so that similar data can be expressed in richer forms, thereby improving the generality of the model. ization ability.
  • Modify the loss function:There is a lot of work in this area, such as Focal Loss, GHM Loss, IOU Loss in target detection, etc., all to improve the generalization ability of the model.
  • Modify the network:If the network is too shallow and the number of parameters is too small, the generalization ability of the model will be insufficient, resulting in underfitting. In this case, simple stacked convolution is generally considered. The layers increase the parameters of the network and improve the feature extraction capabilities of the model. If the network is too deep and the amount of training data is relatively small, it will easily lead to model overfitting. In this case, it is generally necessary to simplify the network structure and reduce the number of network layers or use the residual structure of resnet and the BN layer.
  • Weight penalty:Weight penalty is also a regularization operation. Generally, a regular term of the weight matrix is ​​added to the loss function as a penalty term to punish larger loss values. When the network weight is too large, the network weight often overfits the data sample.
  • Dropout strategy: If the network has a fully connected layer at the end, you can use the Dropout strategy, which is equivalent to Ensemble the deep learning model and helps to improve the generalization ability of the model.

Question 16: Let’s talk about the normalization methods BN, LN, IN, and GN. 

BN

  • BatchNormalization assumes that features are uniformly distributed on different inputs and at the H and W levels, so the mean and variance of each channel are counted on NHW, and the parameter amount is 2C;
  • The disadvantage is that it is easily affected by the distribution of data within the batch. If the batch_size is small, the calculated mean and variance are not representative. And it is not suitable for sequence models, because the length of each sample in sequence models is usually different. In addition, it does not apply when the distribution of training data and test data are different.

LN

  • Layer Normalization, LN is an algorithm independent of batch size. The number of samples will not affect the amount of data involved in LN calculation, thereby solving the two problems of BN;
  • The disadvantage is that in scenarios where both BN and LN can be used, the effect of BN is generally better than that of LN. The reason is that based on different data, the normalized features obtained from the same feature are less likely to lose information.

IN

  • Instance Normalization, the calculation of IN is to take out each HW separately and normalize it, which is not affected by the channel and batch_size. It is often used in stylized migration because it counts the information of each pixel of each sample;
  • The disadvantage is that if the feature map can use the correlation between channels, it is not recommended to use it for normalization.

GN

  • Group Normalization, which first divides the channel into many groups (groups), normalizes each group, and first reshapes the feature dimensions from [N, C, H, W] to [N, G, C// G , H, W], the normalized dimension is [C//G , H, W];
  • The normalization method of GN avoids the impact of batch size on the model. The group normalization of features can also solve the problem of I n t e r n a l InternalInternal C o v a r i a t e CovariateCovariate ShiftShift and achieve better results.

Question 17: If the softmax formula is multiplied by a coefficient a, how does the probability distribution change?​ 

When a>1, it becomes steep; when a<1, it becomes smooth.

Question 18:How to solve the problem of imbalance between positive and negative samples 

  • Oversampling: Oversample categories (minority classes) with a small number of samples in the training set, and synthesize new samples to alleviate class imbalance.
  • Undersampling: Undersample the categories (majority categories) with a larger number of samples in the training set and discard some samples to alleviate the class imbalance.
  • Synthesize new minority classes

Question 19: Reasons why the training network does not converge 

Reasons for data processing

  • No data normalization was done;
  • No data preprocessing is done;
  • No regularization is used;

Reason for parameter setting

  • Batch Size is set too large;
  • The learning rate is set inappropriately;

Network settings reasons

  • The network has bad gradients. For example, when the gradient of Relu for negative values ​​is 0, during back propagation, a gradient of 0 means no propagation;
  • Parameter initialization error;
  • The network settings are unreasonable and the network is too shallow or too deep;

Question 20: Optimization algorithms, Adam, Momentum, Adagard, SGD characteristics

  • Adagard can automatically change the learning rate during the training process and set a global learning rate. The actual learning rate is inversely proportional to the square root of the sum of the squared values ​​of the gradient history. Using adagrad to sum the squares of the previous gradients and then taking the root sign as the denominator will cause the learning rate to amplify at the beginning, and then gradually decrease as the training progresses.
  • Momentum refers to the concept of momentum in physics. The gradients of previous rounds will also participate in the current calculation, but the superposition of gradients in previous rounds will have a certain attenuation in the current calculation. It is used to solve the shortcomings of unstable gradient descent and easy falling into saddle points.
  • SGD is stochastic gradient descent. Each iteration calculates the gradient of the mini-batch of the data set, and then updates the parameters. The advantage is that the update speed is fast, but the disadvantage is that the training is unstable and the accuracy decreases.
  • Adam uses the first-order moment estimate and the second-order moment estimate of the gradient to dynamically adjust the learning rate of each parameter. After bias correction, the learning rate after each iteration has a certain range, making the parameters more stable. Combined with Advantages of momentum and adagrad algorithms.

Question 21: Reasons why small targets are difficult to detect

The size of small targets in the original image is relatively small. In the general target detection model, the general basic backbone neural network (VGG series and Resnet series) has several times of downsampling processing, resulting in the size of small targets in the feature map being basically the same. The pixel size is only single digits, resulting in the designed target detection classifier having poor classification effect on small targets.

The number of small targets in the original image is small, and the detector extracts fewer features, resulting in poor detection effect of small targets.

Neural networks are dominated by large targets in learning, and small targets are ignored throughout the learning process, resulting in poor detection results for small targets.

Tricks
(1) data-augmentation. Simple and crude, such as enlarging the image, using image pyramid multi-scale detection, and finally fusing the detection results. The disadvantage is that the operation is complex and the amount of calculation is Large, impractical in actual situations;
(2) Feature fusion methods: FPN, multi-scale feature map prediction, feature stride can start from smaller;
(3) Appropriate training methods: SNIP and SNIPER of CVPR2018;
(4) Set up smaller and denser anchors. If the regression is not as good as the preset, design anchor match strategy, etc. Refer to S3FD;
(5) Use GAN to enlarge small objects and then detect them. CVPR2018 has such a paper;
(6) Use context information to establish object and context connections, such as relationship network;
(7) There is dense occlusion, how to make location and Classification better, refer to IoU loss, repulsion loss, etc.
(8) When designing convolutional neural networks, try to use a step size of 1 to retain as many target features as possible.
(9) matching strategy. Do not set too strict IoU threshold for small objects, or learn from the ideas of Cascade R-CNN.

Question 22: Describe the YOLOv5 framework

1 Network structure

The network structure of YOLOv5 mainly consists of the following parts:

  • Backbone: New CSP-Darknet53
  • Neck: SPPF, New CSP-PAN
  • Head: YOLOv3 Head

The following is the overall network structure I drew based on yolov5l.yaml. YOLOv5 has the same overall network structure for different sizes (n, s, m, l, x), but will use different depth and depth in each sub-module. width. It should also be noted that in addition to the n, s, m, l, and The difference is that the latter will downsample 64 times and use 4 prediction feature layers, while the former will only downsample 32 times and use 3 prediction feature layers.

By comparing with YOLOv4, in fact, YOLOv5 has not changed much in the Backbone part. However, YOLOv5 has a small change compared to the previous version after version v6.0. The first layer of the network (originally the Focus module) was replaced by a 6x6 convolution layer. The two are actually equivalent in theory. of. The picture below is the original Focus module (similar to the previous Patch Merging in Swin Transformer). Each 2x2 adjacent pixel is divided into a patch, and then the pixels at the same position (same color) in each patch are put together. Four feature maps were obtained, and then connected to a 3x3 convolution layer. This is equivalent to directly using a 6x6 convolutional layer.

The changes in the Neck part are relatively large. First, SPP is replaced by SPPF. The functions of the two are the same, but the latter is more efficient. The SPP structure is shown in the figure below. The input is passed through multiple MaxPools of different sizes in parallel, and then further fused, which can solve the target multi-scale problem to a certain extent.

and SPPF structure is to serialize the input through multiple 5x5 sized MaxPool layers. What needs to be noted here There are two 5x5-sized MaxPool layers in series and one 9x9-sized MaxPool layer The calculation results are the same. Serializing three 5x5-sized MaxPool layers is the same as one 13x13-sized < The calculation results of a i=11> layer are the same. MaxPool

2 Data enhancement

In the YOLOv5 code, there are quite a lot of data enhancement strategies. Here is a brief list of some methods:

  • Mosaic, combine four pictures into one picture

  • Copy paste, paste some targets randomly into the picture, the premise is that the data must have segments data, that is, each target Instance segmentation information. The following is a schematic diagram from the original paper. Copy paste

  • MixUp is to fuse two pictures together with a certain transparency. It is not clear whether it is useful. After all, there is no paper or ablation experiment. Only larger models are used in the codeMixUp, and there is only a 10% probability of being used each time.
  • Albumentations,mainly do filtering, histogram equalization, and changing picture quality, etc.

  • Augment HSV (Hue, Saturation, Value), Randomly adjust hue, saturation and brightness.

  • Random horizontal flip, random horizontal flip

3 Training strategies

Many training strategies are used in the YOLOv5 source code. Here is a brief summary of a few points I noticed. There are also some things you haven't noticed. Please take a look at the source code yourself:

  • Multi-scale training(0.5~1.5x), multi-scale training, assuming that the size of the input image is set to 640 × 640, the size used during training is 0.5 × 640 ∼ Randomly select values ​​between 1.5 × 640. Note that the values ​​are always integer multiples of 32 (because the network will downsample up to 32 times).
  • AutoAnchor(For training custom data), when training your own data set, you can re-cluster and generate Anchors templates based on the targets in your own data set.
  • Warmup and Cosine LR scheduler, perform Warmup warm-up before training, and then use Cosine learning rate reduction strategy.
  • EMA (Exponential Moving Average) can be understood as adding a momentum to the training parameters to make its update process smoother.
  • Mixed precision, mixed precision training can reduce the memory usage and speed up the training, provided that the GPU hardware supports it.
  • Evolve hyper-parameters, hyperparameter optimization, people without experience in alchemy should not touch it, just keep the default.

4 Loss calculation

The loss of YOLOv5 mainly consists of three parts:

  • Classes loss, classification loss, uses BCE loss. Note that only the classification loss of positive samples is calculated.
  • Objectness loss, obj loss, still uses BCE loss. Note that obj here refers to the CIoU of the target bounding box and GT Box predicted by the network. What is calculated here is the obj loss of all samples.
  • Location loss, location loss, CIoU loss is used. Note that only the location loss of positive samples is calculated

Guess you like

Origin blog.csdn.net/qq_41921826/article/details/133526883