Compilation of computer vision interview questions

1. Introduce the principles of the target detection network yolo series and ssd series. Why is yolo not good at detecting small targets? How can it be improved besides reducing the anchor?

  • Yolo target detection: YOLO is a real-time target detection algorithm. Its core idea is to classify the target detection problem as a regression problem and predict the category and location of the target directly from the input image. The main features of YOLO are ① Single forward propagation (one-stage): YOLO only needs one forward propagation to complete the entire target detection task, without complex multiple processing. Segmentation grid : The input image is divided into a fixed number of grid units, and each grid unit is responsible for predicting the location and category of the objects contained in it. ③Multi -scale prediction : YOLO uses multi-scale anchor boxes to handle objects of different sizes and aspect ratios, which helps improve detection performance. ④Loss function : YOLO uses a multi-part loss function to measure classification error and positioning error, while encouraging the model to predict the bounding box of the object to accurately locate the object. ⑤Real -time performance : YOLO performs well in real-time target detection tasks due to its single forward propagation and efficient design.
  • SSD (Single Shot MultiBox Detector): Unlike YOLO, SSD uses a multi-level feature extraction method to handle objects of different sizes. ①Multi -level feature extraction : SSD uses multiple convolutional layers to extract image features and perform target detection at different levels. ②Anchor box : SSD uses anchor boxes to predict objects of different sizes and aspect ratios. Each anchor box is associated with a specific location and size. The model predicts the location and category of the object based on these anchor boxes. ③Loss function : Similar to YOLO, SSD uses a multi-part loss function to measure classification error and positioning error to optimize the performance of the model. ④Efficiency : SSD achieves a good balance between speed and accuracy. Although it may not be as fast as some YOLO versions, it performs well in detection accuracy .
  • The reason why YOLO is not good at detecting small targets: because it was originally designed to detect relatively large targets. The methods that can be adopted are: ① Increasing the resolution of the input image usually helps to improve the detection performance of small targets. ②YOLO can improve the detection performance of small targets by performing target detection at different scales . ③Using feature pyramid can improve small target detection. ④Introducing the attention mechanism can help the network pay more attention to small targets. ④Using data enhancement techniques , such as random cropping, scaling, rotation, etc., can generate more small target samples to help the network learn how to identify and locate small targets. ⑤Set smaller and denser anchors. When designing the convolutional to retain as many target features as possible.

2. How to solve the imbalance between positive and negative samples in the sample?

  • Use a class-balanced cross-entropy loss function
  • Perform data augmentation on small samples
  • Resampling (oversampling, increasing the number of minority class samples; undersampling, reducing the number of majority class samples)
  • Weight adjustment, modify the loss function, and give different weights to samples of different categories.

3. Briefly introduce the principle of support vector machine SVM.

Support vector machine is a two-class classification model. Its basic model is defined as a linear classifier with the largest interval on the feature space. Its learning strategy is to maximize the interval, which can ultimately be transformed into the solution of a convex quadratic programming problem.

4. Which machine learning algorithms do not require normalization?

  • Models that need to be normalized: models based on distance calculation: KNN, models solved by gradient descent (linear regression, logistic regression, support vector machine, neural network)
  • Tree models do not require normalization because they do not care about the values ​​of variables, but about the distribution of variables and the conditional probabilities between variables. Such as decision trees and random forests.

5. Why does the tree structure not need normalization?

Because numerical scaling does not affect the position of the split point, it has no impact on the structure of the tree model. If sorted according to eigenvalues, the sorting order remains unchanged, so the branches and split points to which they belong will not be different. Moreover, the tree model cannot perform gradient descent, because when building a tree model (regression tree) to find the optimal point, it is completed by finding the optimal split point. Therefore, the tree model is step, the step point is non-differentiable, and the derivation is not possible. Meaning, there is no need for normalization.

6. In k-means or KNN, we often use Euclidean distance to calculate the distance between nearest neighbors. Sometimes we also use Manhattan distance. Please compare the difference between these two distances.

  • Euclidean distance: It is the most common distance measurement method, also known as straight-line distance. It calculates the distance between two points just like you would with a straight line distance on a 2D plane. Euclidean distance takes into account the differences in each dimension, so it is more suitable for situations where the scales of features in each dimension are similar.
  • Manhattan distance: It is the sum of the lengths of vertical line segments along the coordinate axis, so it is also called the city block distance. It calculates the distance between two points, just like the distance you would walk down a street in a city. Manhattan distance is more suitable for situations where the scales of features in each dimension are taken into account, because it is calculated independently in each dimension. 
  • Summary and comparison: ① Euclidean distance usually performs better when the scales in each dimension are similar, because it takes into account the differences between each dimension. ②Manhattan distance is more suitable when the scales in each dimension are different or the data shows an obvious block distribution, because it does not consider the differences between the dimensions, but only calculates the distance on the coordinate axis. ③The choice of which distance measurement method usually depends on the nature of the problem and the characteristics of the data. In some cases, you can even try using other custom distance measures to better capture similarities or differences between data.

7. Reasons why CNN performs well on images

Using image data directly as input not only eliminates the need for manual image pre-processing and additional feature extraction and other complex operations, but also allows image processing to reach an almost human level with its unique fine-grained feature extraction method.

8. Calculation of parameters and calculation amounts

The convolution input is W x H x C, the convolution kernel is K x K x N, and the output is W1 x H1 x C1.

  • Calculation amount: W1 x H1 x C1 x K x K x C 
  • Parameters: C1 x K x K x C

9. Experience in adjusting parameters and modifying models

  • Data level : Obtain more data, amplify or generate data, normalize or standardize data, and re-select features.
  • Algorithm level: Conduct a sample survey on the algorithm. Select the algorithm with the best performance, and then improve the resampling method through further parameter tuning and data preparation. Model selection and parameter tuning can be completed on a small data set first, and then the final method can be extended to the entire data set.
  • Parameter adjustment: ① Diagnosis, in each cycle, evaluate the performance of the model on the training set and validation set, and make a chart; ② Weight initialization , try different initialization methods, and examine whether there is a method that can be used when other conditions remain unchanged. The effect is better; ③ Learning rate: try a learning rate that decreases with the cycle or increase the momentum term; ④ Activation function : try common activation functions, and rescale your data to meet the boundaries of the activation function; ⑤ Batchsize and period . Try different batch sizes and number of epochs. The batch size will determine the final gradient. and the frequency of updating weights. ⑥Regularization : Try different regularization methods, weight decay (Weight decay) to punish large weights, activation constraint (Activation constraint) to punish large activation values, test dropout methods in the input, hidden layer and output layer respectively, or Use L1 and L2 regularization. Optimization algorithms and loss functions: Try different optimization algorithms (SGD, ADAM, RMSprop,,,). The loss function to be optimized is highly relevant to the problem you are trying to solve, and must be adjusted appropriately. ⑧Early stopping. Once the performance of the validation set decreases during the training process, you can stop training and learning. This is a regularization method to avoid overfitting of the model on the training data.
  • Improve performance through nested models: Get excellent predictive power by combining multiple "good enough" models, rather than combining multiple highly tuned (brittle) models.

10. Briefly describe the differences and improvements of Inception v1-v4.

  • v1: ① Using convolution kernels of different sizes means different sizes of receptive fields, and the final splicing means the fusion of features of different sizes; ② Using the commonly used convolution (1x1, 3x3, 5x5) and pooling operations (3x3) of CNN Stacking them together (the size after convolution and pooling is the same, adding the channels) on the one hand increases the width of the network, on the other hand it also increases the adaptability of the network to size; ③ In order to reduce the amount of calculation, a 1x1 volume is added product.
  • v2: ① Convolution decomposition, replacing a single 5x5 convolutional layer with a small network composed of two consecutive 3x3 convolutional layers, which reduces the number of parameters while maintaining the receptive field range, and also deepens the network. ② Proposed the famous Batch Normalization (BN)  method. BN will standardize the interior of each mini-batch data so that the output can be normalized to the normal distribution of N(0,1), speeding up the training speed of the network and increasing the size of the network. The learning rate ③ BN plays a role in regularization in a sense, so dropout can be reduced or eliminated and the network structure can be simplified. v2 is 14 times faster when training reaches v1 accuracy, and the final convergence accuracy is also higher than v1.
  • v3: ① Considering the nx1 convolution kernel, a larger two-dimensional convolution is split into two smaller one-dimensional convolutions (7x7 is split into 7x1 and 1x7, 3x3 is split into 1x3 and 3x1). On the one hand It saves a large number of parameters, speeds up operations and reduces over-fitting), while further increasing the depth of the network and increasing the nonlinearity of the network. ②Optimized the structure of Inception Module
  • v4: Use residual structure (Residual Connection) to improve the v3 structure

11. How is the inception structure in Inception v1 designed?

  • Using convolution kernels of different sizes means receptive fields of different sizes, and the final splicing means the fusion of features of different scales.
  • This structure stacks the convolutions (1x1, 3x3, 5x5) and pooling operations (3x3) commonly used in CNN (the sizes after convolution and pooling are the same, and the channels are added). On the one hand, it increases the width of the network , on the other hand, it also increases the adaptability of the network to scale.

  • However, in the original version of Inception above, all convolution kernels are performed on all outputs of the previous layer, and the amount of calculation required for the 5x5 convolution kernel is too large, requiring about 120 million calculations, resulting in The thickness of the feature map is very large.
  • In order to avoid this situation, a 1x1 convolution kernel is added before 3x3, before 5x5, and after max pooling to reduce the thickness of the feature map . This also forms the network structure of Inception v1, as shown in the figure below. Show:

12. Why does Inception use 1x1 convolution kernel?

  • The main purpose of 1x1 convolution is to reduce the dimension and also to correct the linear activation (relu). For example, the output of the previous layer is 100x100x128, after a 5x5 convolution layer with 256 channels (stride=1, pad=2) , the output data is 100x100x256, where the convolutional layer parameters are 128x5x5x256=819200. And if the output data of the previous layer first passes through a 1x1 convolution layer with 32 channels, and then passes through a 5x5 convolution layer with 256 outputs, then the output data is still 100x100x256, but the amount of convolution parameters has been reduced to 128x1x1x32 + 32x5x5x256 = 204800, a reduction of approximately four times.
  • It deepens the layer of the network and enhances the nonlinearity of the network.

13. Evolution of CNN network

  • LeNet: 2 convolutions + 3 fully connected, first used for digital recognition.
  • AlexNet: 12-year ImageNet champion, 5 convolutions and 3 fully connected, multiple small convolutions instead of a single large convolution, using the ReLU activation function to solve the gradient decimal problem; introducing dropout to avoid model overfitting; maximum pooling.
  • ZF-Net: ImageNet champion in 2013, using only the dense connection structure of a GPU; changing the convolution kernel of the first layer of AlexNet from 11 to 7, and the step size from 4 to 2.
  • VGG-Nets: Second place in the ImageNet classification in 2014. A deeper network. The convolutional layer uses smaller filter sizes and intervals; multiple small convolutions give the network more nonlinearity and fewer parameters.
  • GoogLeNet : No. 1 in ImageNet classification in 2014. The Inception module is introduced, using convolution kernels of different sizes means receptive fields of different sizes, and the final splicing means the fusion of features of different scales; average pooling is used instead of the fully connected layer to avoid gradient disappearance, and the network adds two additional auxiliary The softmax is used to conduct the gradient forward.
  • ResNet : Introducing the residual unit, simplifying the learning objectives and difficulty, speeding up the training, and no degradation problems will occur when the model is deepened; it can effectively solve the problems of gradient disappearance and gradient explosion during the training process.
  • DenseNet : dense connection; strengthens feature propagation, encourages feature reuse, and greatly reduces the amount of parameters.

14. Introduction to CNN, each layer and its function

  • The feature detection layer of CNN learns through training data. When using CNNC, explicit feature extraction is avoided and learning is performed implicitly from the training data. Since the neuron weights on the same feature mapping surface are the same, the network Parallel learning is possible, which is also a major advantage of convolutional networks over networks where neurons are connected to each other. Convolutional neural network has unique advantages in speech recognition and image processing with its special structure of local weight sharing. Weight sharing reduces the complexity of the network. In particular, images with multi-dimensional input vectors can be directly input into the network, which avoids the complexity of data reconstruction during feature extraction and classification.
  • Convolutional networks mainly consist of convolutional layers, activation functions, pooling layers, and fully connected layers. ① Convolution layer: Use convolution kernels for feature extraction and feature mapping; ② Activation function (Activation) : Since convolution is also a linear operation, nonlinear mapping needs to be added; ③ Pooling layer (Pool) : For input The feature map is compressed, on the one hand, it makes the feature map smaller and simplifies the network calculation complexity; on the other hand, it compresses the features and extracts the main features; ④ Fully connected layer (FC) : connects all the features and sends the output value to the classifier. (Take a rest first and then update

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/132857594