Deep learning algorithm interview frequently asked questions (1)

The blogger Qiuzhao encountered interview questions and sorted out other face-to-face related questions, and shared them for free~

Project description:

  1. Algorithm requirements and application scenarios
  2. Algorithm research and preliminary plan formulation
  3. Data preparation (including data annotation and data augmentation)
  4. Introduction to the algorithm (including input and output, loss, backbone, training details, etc.)
  5. My own introduction to the modules added by the algorithm (why should I change it like this? Ablation experiment)
  6. Calculation of various indicators
  7. Does the algorithm take into account the amount of parameters and reasoning speed, etc.

sweetie

An important indicator to measure the accuracy of image segmentation, mIOU can be interpreted as the average intersection ratio, that is, the Iou value is calculated on each category (the number of real samples / (the number of real samples + the number of false negative samples + the number of false positive samples))

picture

Advantages and disadvantages of dilated convolution

Advantages: Increase the receptive field without downsampling

Disadvantages: 1. Grid effect. If you only stack multiple 3*3 convolution kernels with a hole rate of 2, you will find that not all input pixels will be calculated, that is, the convolution kernels are not continuous. In pixel-by-pixel prediction tasks, this is fatal.

  1. Long-distance information may not be relevant, resulting in dilated convolutions that are effective for large object segmentation, but may not be beneficial for small object segmentation

How to overcome the grid effect?

  • The dilation rate of stacked dilated convolutions cannot have a common divisor greater than 1. This is to combat grid effects. (2, 4, 6) will not work
  • The expansion rate is designed into a sawtooth structure, which is to meet the needs of large and small objects. [1, 2, 5, 1, 2, 5]
  • The expansion rate of the dilated convolution of the last layer is the largest, and the expansion rate is less than or equal to the size of the convolution kernel. Primarily against grid effects.

1x1 convolution kernel function

  • Dimensionality enhancement and dimensionality reduction are achieved by controlling the number of convolution kernels
  • Perform normalization operations on different features
  • Cross-channel information exchange
  • Increase the nonlinearity of the network

Transposed convolution:

When the padding is 0 and the stride is 1:

  • Fill the input with k-1 (k is the convolution kernel size)
  • Flip the convolution kernel up and down, left and right
  • Then do normal convolution (fill 0, stride 1)

insert image description here

stride s: insert s-1 rows or columns between rows and columns
insert image description here

insert image description here

What is the difference between the principle of CenterNet and traditional target detection?

CenterNet is anchor-free. Compared with yolov3, when the speed is the same, the accuracy of CenterNet is several points higher than that of yolov3. It does not need NMS to directly detect the center point and size of the target.

Algorithm flow:

  • 1 x 3 x 512 x 512 32 times downsampled feature map obtained after backbone feature extraction 1 x 2048 x 16 x 16
  • Then it is up-sampled to 128 x 128 through a three-layer transposed convolution module, and finally sent to three head branches for prediction
  • The prediction results are object category, length and width size and center point offset
  • The core of reasoning is to extract the required bounding box from the heatmap. By using 3x3 maximum pooling, check whether the current hotspot value is larger than the surrounding 8 neighboring point values. Take 100 points for each category, and then process the box after processing. Through the threshold screening, the final prediction box is obtained.

CSP structure principle and function:

CSP is an idea, similar to resnet and densenet. It splits the feature map into two parts, one part performs convolution operation, and the other part performs cat with the result of the previous part convolution operation. Three problems are solved:

  • Enhanced CNN learning ability, able to maintain accuracy while being lightweight
  • reduce computing cost
  • Reduce memory overhead. CSPNet improves the information flow of dense blocks and transition blocks, optimizes the gradient backpropagation path , improves the network learning ability, and improves a lot in terms of processing speed and memory.

Introducing Faster R-CNN and Cascade R-CNN

Faster R-CNN is an anchor-based and two-stage type detector, divided into four parts

  • The backbone feature is extracted, and then the RPN layer judges whether the anchors are positive or negative through softmax , and then uses the border regression to correct the anchors to obtain accurate candidate areas
  • RPN generates a large number of candidate regions, which are sent to ROI pooling together with feature maps to obtain candidate feature regions, and finally sent to classification and frame regression to obtain the final prediction result

Cascade R-CNN is improved on Faster R-CNN, and the purpose of continuously optimizing the prediction results is achieved by cascading several detection networks. trained on negative samples. Simply put, cascade R-CNN is composed of a series of detection models. Each detection model is trained based on positive and negative samples with different IOU thresholds. The output of the previous detection model is used as the input of the latter detection model, so it is stage by The stage training method , and the later the detection model, the IOU threshold for defining positive and negative samples is constantly rising.

Why does linear regression use mse as a loss function?

When using linear regression, the basic assumption is that the noise obeys a normal distribution. When the noise conforms to a normal distribution, the dependent variable conforms to a normal distribution. Therefore, when using mse, it is actually assumed that y obeys a normal distribution.

optimization:

  1. Basic gradient descent methods, including SGD, etc.
  2. Momentum optimization methods, including momentum, NAG, etc.
  3. Adaptive learning rate optimization methods, including Adam, AdaGrad, RMSProp, etc.

The principle and common methods of upsampling:

In CNN, since the input image is extracted through CNN, the output size tends to be smaller, and sometimes we need to restore the image to its original size for further calculation (image segmentation), and the operation from small resolution to large resolution is upsampling .

  • Interpolation, generally using bilinear interpolation, has the best effect (complex calculation), but compared with convolution calculation, the calculation amount is not worth mentioning
  • Transpose convolution, by filling the input feature map interval with 0, and then performing standard convolution calculation, the output feature map size can be made larger than the output
  • Max unpooling, record the maximum index value at the symmetrical max pooling position, then fill the value back, and fill the remaining positions with 0

Parameter amount and FLOPs calculation

Amount of parameters:
insert image description here

FLOPs:

insert image description here

Add and concat operations in CNN:

The add and concat branch operations are collectively called shortcuts, and add is proposed in ResNet to increase the expression of information. The concat operation was used for the first time by Inception and was carried forward by DenseNet. It retained some original features and increased the number of features so that the effective information flow continued to pass backwards.

Why neural networks often use relu as activation function

  • Small amount of calculation, avoid gradient disappearance
  • Relu can make the output of some neurons 0, resulting in network sparsity, reducing the influence of front and rear layer parameters on the current layer parameters, and improving the generalization performance of the model

Why should BN refactor

To restore the features learned by a certain layer of the original, so λ and β are introduced, so that our network can learn to restore the feature distribution that the original network needs to learn. For example, the data distribution is unbalanced.

The difference between BN training and testing

  • Training phase: first calculate the mean and variance (each training is given a batch, calculate the mean and variance of the batch), then normalize, then zoom and translate
  • Test phase: Only one image is input at a time, and the batch mean and variance cannot be calculated. Calculate running_mean and running_var during training, and use them directly during testing without calculating the mean and variance.

Differences and improvements of Inception V1 - V4

V1: Some large convolution kernels of googlenet are replaced with small convolutions

V2: Add BN when inputting, the convergence is faster during training, and the use of dropout can be reduced

V3: Some 7x7 convolutions in GoogleNet become 1x7 and 7x1 two-layer concatenation, and 3x3 in the same way. This speeds up the calculation, increases the nonlinearity of the network, and reduces the probability of overfitting.

V4: The original inception plus resnet method

Why is the cross-entropy loss function more commonly used than the mean square error loss function in classification tasks?

The gradient expression of the cross-entropy loss function with respect to the input weight does not contain the gradient of the activation function, but there is a gradient of the activation function in the mean square error loss function (the predicted value is close to 1 or 0, which will cause its gradient to be close to 0), due to the commonly used sigmoid/ There is a gradient saturation region in tanh, making the gradient of MSE to weights very small.

Generalization error (overfitting)

Generalization Error = Variance + Bias + Noise

  1. Noise is the upper limit of model training, and it can also be said to be the lower limit of error, and noise cannot be avoided
  2. The variance indicates the stability of the model prediction under different samples (the larger the variance is, the more fitting it is, the prediction is unstable)
  3. The deviation indicates how well the model fits the training data (a large deviation means underfitting, and the prediction effect is not good)

Ways to reduce variance (overfitting):

data:

  1. Increase the amount of data and perform data enhancement
  2. Perform data cleaning and feature selection to reduce feature dimensions
  3. class balance

Network structure:

  1. Regularized L1, L2, BN

  2. dropout

Macro:

  1. Select a model of appropriate complexity, or prune an existing model
  2. label smoothing
  3. earlystopping
  4. increase the learning rate
  5. by cross-validation

Ways to reduce bias (underfitting):

In the middle of the network:

  1. Remove or weaken existing regularization constraints

Macro:

  1. Increased model complexity
  2. integrated learning
  3. Increase the number of epochs

Why can dropout reduce overfitting?

  1. Ensemble effect: Different neurons are randomly turned off each time during the training process, and the network structure has changed. The entire dropout training process is equivalent to taking the average of many different networks to achieve the ensemble effect.
  2. Reduce the complex co-adaptive relationship between neurons: dropout causes every two neurons not necessarily to appear in the network every time, reducing the dependency between neurons. It prevents some features from being effective only under other specific features, thus forcing the network to not focus on special cases, but to learn some more robust features.

KMeans

The specific steps of the algorithm are described:

  1. Select K objects in the data space as initial centers, and each object represents a cluster center
  2. Traverse all data and divide each data to the nearest center point
  3. Calculate the mean value of each cluster and use it as the new center point
  4. Repeat 2-3 until k center points no longer change

advantage:

  1. The principle is simple, the implementation is easy, and the convergence speed is fast
  2. Algorithms are highly interpretable
  3. The main tuning parameter is just k

shortcoming:

  1. The selection of k is not easy to grasp
  2. The initialized cluster center is sensitive, and different selection methods will get different results
  3. If the data type is unbalanced , such as the amount of data is seriously unbalanced or the variance of the categories is different, the clustering effect is not good
  4. Using iterative methods, only local optimal solutions can be obtained
  5. Sensitive to noise and outliers

Focal loss

The positive and negative samples of the one-stage algorithm are seriously unbalanced, and the two-stage algorithm uses the RPN network to control the positive and negative samples at about 1:3

principle:

  1. Solve the problem of unbalanced positive and negative samples: add weight control and increase the weight of minority samples

picture

  1. Solve the problem of difficult and easy samples: the larger the Pt, the easier it is to distinguish the sample, and the weight of the easy-to-distinguish sample should be reduced. That is to say, it is hoped to add a coefficient, and the sample with a higher probability has a smaller weight coefficient. In addition, in order to improve the controllability, the coefficient λ is introduced

picture

​ Synthesis: The two parameters α and λ are coordinated and controlled. The author of this article uses α=0.25 and λ=2 for the best effect

Regularization

Principle: Add some rules (restrictions) to the loss function to reduce the solution space, thereby reducing the possibility of finding an overfitting solution

Commonly used regularization methods:

  1. data augmentation
  2. Use the L-norm constraint
  3. dropout
  4. early stopping
  5. confrontation training

The advantages and disadvantages of softmax and sigmoid in multi-classification tasks

Multiple sigmoids and a softmax can perform multi-classification. If multiple categories are mutually exclusive, softmax should be used, that is, this thing can only be one of several categories. If multiple categories are not mutually exclusive, use multiple sigmoids.

insert image description here

How pooling backpropagates

max pooling: The gradient of the next layer will be passed to the neuron where the maximum value of the previous layer is intact, and the other gradients will be 0

average pooling: The gradient of the next layer will be evenly distributed to all neurons of the corresponding connected block of the previous layer

pooling function:

  1. Increase the receptive field 2. Translation invariance 3. Reduce the difficulty of optimization

    Disadvantages: resulting in sparse gradients and loss of information

AuC,RoC,mAP,Recall,precision,F1-score

Recall = R=TP/(TP+FN) Precision:P=TP/(TP+FP)

ROC: Commonly used to evaluate the pros and cons of two classifications

AUC: It is defined as the area under the ROC curve, the value range is between 0.5 and 1 AUC calculation formula: https://blog.csdn.net/ustbbsy/article/details/107025087

F1-score: Also known as the harmonic mean, precision and recall are contradictory. When the recall is larger, there are more frames, resulting in smaller precision, and vice versa. F1-score is usually used to reconcile precision and recall, and the result of F1 score depends on precision and recall.

insert image description here

The principle of pytorch multi-GPU training mechanism

pytorch's multi-GPU processing api: torch.nn.DataParallel(module, device_ids), where module is the model to be executed, and device_ids is a list of specified parallel GPU ids

The parallel processing mechanism is to first load the model to the master GPU, then copy the model to each designated slave GPU, and then divide the input data according to the batch dimension, and the data allocated to each GPU is the number of batch/gpu. Each GPU will train independently for its own input data, and finally sum the losses of each GPU, then use backpropagation to update the model parameters on a single GPU, and then copy the updated model parameters to the remaining designated GPUs , thus completing an iterative calculation.

bias and variance

Model Error = Variance + Bias + Unavoidable Error

Bias: The difference between the predicted value and the true value. The greater the deviation, the more it deviates from the real data

Variance: Describes the variation range and degree of dispersion of the predicted value. The larger the variance, the more dispersed the data distribution.

KL divergence

KL divergence, also called relative entropy, is used to measure the degree of difference between two probability distributions.

insert image description here

KL divergence is also called relative entropy, KL divergence = information entropy - cross entropy

What types of features are extracted at different levels of convolution?

  1. Shallow convolution to extract edge features
  2. The middle layer convolution extracts local features
  3. Deep convolution extracts global features

BN level interview high frequency questions

1. What problem does BN solve?

The distribution of the activation input values ​​of the neural network before nonlinear transformation gradually shifts as the depth of the network deepens (internal covariate shift). The reason why the training convergence is slow is that the overall distribution is gradually approaching the two ends of the nonlinear function interval , which leads to the disappearance of the training gradient of the backpropagation underlying network. BN is to use a certain regularization method to forcibly pull the distribution of the input value of any neuron in each layer of neural network back to the standard normal distribution with a mean of 0 and a variance of 1.

2. BN formula

insert image description here

where γ and β are the scaling factor and offset respectively, since subtracting the mean divided by the variance is not necessarily the best distribution. Therefore, two learnable variables should be added to improve the data distribution to achieve better results.

3. The difference between BN layer training and testing

Training phase: BN layer standardizes the training data of each batch, that is, uses the mean and variance of each batch of data

Test phase: only a single test sample is input, so the mean and variance used at this time are the mean and variance of the entire data set after training, which can be obtained by the moving average method. In simple terms, the mean is to directly calculate the average of all batch means, and then use an unbiased estimate of the variance of each batch for the standard deviation.

4. Why not use the mean and variance of the entire training set during BN training?

Because it is easy to overfit with the mean and variance of the entire data set, for BN, it is to standardize each batch of data to the same distribution, and the mean and variance of different batch data will have certain differences, which can increase the model The robustness will also reduce overfitting to a certain extent.

6. The parameter amount of BN

γ and β are parameters that need to be learned. In CNN, the batch size of a certain layer is M, so the parameter amount for BN is M*2

7. Advantages and disadvantages of BN:

advantage:

  1. You can choose a larger initial learning rate, because the distribution of each layer is fixed, and there will be no phenomenon that the bottom layer is difficult to train
  2. Without dropout, L2 regularization: the use of BN makes all samples in a mini-batch connected together, so the network will not generate definite results in a certain training sample
  3. Datasets can be completely scrambled
  4. The model is more robust
  5. The model trains faster and reduces vanishing gradients
  6. Reduce parameter initialization sensitivity

shortcoming:

  1. BN is very dependent on the size of the batch. When the batch value is small, the calculated mean and variance are unstable
  2. To be updated 1! ! !

NMS process

  1. First set two values: a score threshold and an IOU threshold
  2. Filter out the candidate boxes that are less than the score threshold, and sort according to the category classification probability: A < B < C < D < E < F
  3. First mark the highest probability rectangular box F is the candidate we want to keep
  4. Next, judge whether the IOU of A~E and F is greater than the IOU threshold. Assuming that the overlap of B, D and F exceeds the IOU threshold, then remove B and D
  5. From the remaining A, C, and E, select E with the highest probability, mark it as a candidate frame to be saved, then judge the degree of overlap between E and A, C, and remove the frame whose overlap exceeds the set threshold
  6. Repeat 3~5 until there is no rectangle left.

Soft NMS code and implementation

Disadvantages of traditional NMS:

  • When the first two target boxes are close, the box with a lower score will be deleted because the overlapping area is too large
  • The threshold of NMS is not easy to determine. If it is set too small, it will miss detection. If it is set too large, it will easily increase false detection.

Advantages of Soft NMS algorithm:

  • It can be easily introduced into object detection without retraining the original model
  • soft nms adopts traditional nms in training, it is possible to implement soft nms only in inference code

Idea: Do not directly delete all boxes with IOU greater than the threshold, but reduce the confidence.

(1) Linear weighting

img

(2) Gaussian weighting

img

The two improved ideas are to reduce the confidence of the box, so that it is screened out in the first stage. The relationship between iou and confidence is considered.

Which methods can improve the performance of small target detection

  1. Increase image resolution
  2. Increase the input resolution of the model
  3. tile image
  4. data augmentation
  5. Automatically learn anchor
  6. category optimization

Why does adding the residual module have an effect?

Hypothesis: If the residual module is not used, the output is 5.1, and the expected output is 5. If you want to learn that the output is 5, the rate of change is relatively low, and it is difficult to learn.

But if you design a H(X) = F(x)+5=5.1, so that F(x)=0.1, then the learning goal is to make 0.1 become 0, which is relatively simple, that is, the mapping after introducing the residual module more sensitive to output changes.

Further understanding: If F(x) = 5.1, now continue to train the model so that the mapping function = 5. The rate of change is (5.1-5)/5.1 = 0.02. If the residual module is not used, the learning rate may be set from 0.01 to 0.0000001. It can be dealt with if the number of layers is small, but once the number of layers is deepened, it may not be very effective.

Disadvantages of ResNet

The layer where resnet really works is only in the middle, and the deep layer is relatively small. (In the deep network, only the identity map is learning, which is equivalent to many small networks in ensemble)

The structure and characteristics of the MobileNet series of models?

  • The v1 network structure is composed of depthwise convolution (spatial information) and pointwise convolution (channel information) on the basis of VGG, which greatly reduces the amount of model parameters without losing too much accuracy.

  • The v2 network structure uses linear bottleneck (linear transformation) to replace the original nonlinear activation function. Experiments have proved that using linear bottleneck can better retain useful feature information in small networks . Inverted Residuals is just the opposite of resnet's inter-channel operation. Since v2 uses the linear bottleneck structure, the feature dimension extracted by it is generally low (relu will appear a lot of 0 in low-dimensional data), and the effect of using low-dimensional feature map is not good. Therefore, we need a high-dimensional feature map to supplement.

  • V3 has two major innovations as a whole: 1. NAS with limited resources performs module-level search; netadapt performs local search, and fine-tunes the network layer after each module is determined. 2. Network structure improvement: further reduce the number of network layers, and introduce the h-swish activation function. The author found that the swish activation function can effectively improve the accuracy of the network, but the swish calculation is too large.

insert image description here

insert image description here

The structure and characteristics of the VIT model?

**Features:**1. VIT directly uses the standard transformer structure for image classification, and its model structure does not contain CNN. 2. In order to meet the requirements of the transformer input structure, the picture is input into the network as a patch. At the final output, Class Token is used for classification prediction. 3. After the transformer structure vit is pre-trained on a large-scale data set, and then transfer learning, it can achieve SOTA performance on specific tasks. It can be divided into the following parts: (1) image block embedding (2) multi-head attention structure (3) multi-layer perceptron structure (4) using droppath, class Token, positional encoding

Structure and features of EfficientNet series?

EfficientNet is a model obtained by jointly adjusting the search from three angles of depth, width and input image resolution through network search. These three dimensions are not independent of each other. For the case of higher input image resolution, a deeper network is required. A larger receptive field of view is obtained, and more channels are required to obtain more precise features.

The interior of the EfficientNet model is implemented through multiple MBConv convolution modules. Experiments have proved that depthwise convolution is still very effective in large models, because depthwise convolution has better feature extraction and expression capabilities than standard convolution. The DropConnect method is used to replace the traditional dropout to prevent overfitting. The difference between dropConnect and dropout is that in the process of training the neural network model, it does not randomly discard the output of the hidden layer nodes, but the input of the hidden layer nodes. Discarded randomly.

insert image description here

How to deal with unbalanced data categories?

  • data augmentation
  • Sampling of minority class data and undersampling of majority class data
  • Weight balance of loss function (focal loss)
  • Collect data from minority classes
  • Threshold adjustment, adjust the original default threshold of 0.5 to: less categories/(less categories + more categories)

Sorting Algorithm

insert image description here

python decorator

A decorator allows adding some extra functionality to an existing function by passing it to the decorator which will perform the functionality of the existing function plus the added extra functionality.

import logging
def use_log(func):
	def wrapper(*args, **kwargs):
        logging.warning('%s is running' % func.__name__)
        return func(*args, **kwargs)
    return wrapper
@use_log
def bar():
	print('I am bar')
    
------------结果如下------------
WARNING:root:bar is running
I am bar

Python's deep copy and shallow copy

In python, assigning a variable to another variable is actually adding a "reference" to the object in the current memory.

  • Shallow copy: Create a new object whose content is a reference to the elements in the original object (the new object shares the sub-objects in the memory with the original object), such as slice operations, factory functions, object copy methods, and copy functions in the copy module.
  • Deep copy: Create a new object, and then recursively copy the sub-objects contained in the original object. The deep copied object has no relationship with the original object, the deepcopy function in the copy module

**Note:** The difference between shallow copy and deep copy is only for composite objects. The so-called composite objects are objects that contain other objects, such as lists, class instances, etc. For numbers, strings, and other atomic types, there is no such thing as copying, and all the references to the original objects are generated.

Is python an interpreted or compiled language?

The advantage of an interpreted language is good portability, but the disadvantage is that it requires an interpreted environment to run. It runs slower than a compiled language, takes up a little more resources, and has low code efficiency.

The advantages of compiled language are fast running speed, high code efficiency, compiled program cannot be modified, and good confidentiality. The disadvantage is that the code needs to be compiled to run, the portability is poor, and it can only run on compatible operating systems.
insert image description here

Python's garbage collection mechanism

In python, use reference counting for garbage collection; at the same time, use the mark-clear algorithm to solve the problem of circular references that may be generated by container objects, and finally improve the efficiency of garbage collection through the generational collection algorithm

  • Reference count: When a variable holds a reference to an object, the reference count of the object will increase by 1. When del is used to delete the object pointed to by the variable, the reference count is decremented by 1. If the reference count is 0, the current object is deleted.
  1. The main garbage collection method: reference counting, each object maintains a value inside, which records the number of times the object is referenced, if the number is 0, the python garbage collection mechanism will automatically clear the object
  2. Auxiliary method: mark-clear algorithm: the difference between the count value of the allocated object and the count value of the released object exceeds a certain threshold , and the python collection mechanism starts
  3. Auxiliary method: When the gc.collect() command is executed in the code, the python interpreter will perform garbage collection.

The difference between range and xrange in python

First of all, the usage of xrange and range functions are exactly the same, the difference is that the xrange function does not generate a list object, but a generator. To generate a large number sequence, using xrange will have much better performance than range, because there is no need to open up a large memory space as soon as it comes up. In python3, the xrange function is removed, only the range function remains, but the function of the range function combines xrange and range. Before python2, the range was a list type, and in python3 it was a range type.

Guess you like

Origin blog.csdn.net/weixin_45074568/article/details/129068787