Common interview questions-deep learning articles (continuous update)

1. What is the function of pooling, why use pooling, and the types of pooling?

The main function of Pooling is to retain the main features while maintaining the invariance of rotation and translation. There are mainly max pooling and average pooling. Generally speaking, max pooling is more prominent in the foreground, and average pooling is more prominent in the background.

2.  How does the pooling layer perform backpropagation and gradient update?

For max pooling, only the gradient at the max value is updated, and everything else is 0; for average pooling, all elements at the previous pooling position are updated, and the updated value is the gradient at the corresponding position of the average pooling layer divided by the number of samples (such as 2x2 pooling corresponds to 4)

      mean pooling

The forward propagation of mean pooling is to average the values ​​in a patch to do pooling, then the process of backpropagation is to divide the gradient of an element into n equal parts to the previous layer, so as to ensure pooling The sum of the gradients (residuals) before and after remains unchanged, which is quite understandable. The diagram is as follows:

  max pooling

Max pooling must also satisfy the principle of invariance of the sum of gradients. The forward propagation of max pooling is to pass the largest value in the patch to the next layer, while the values ​​of other pixels are directly discarded. Then backpropagation is to pass the gradient directly to a certain pixel in the previous layer, and other pixels do not accept the gradient, which is 0. Therefore, the difference between max pooling operation and mean pooling operation is that it needs to record which pixel has the largest value during the pooling operation, that is, max _id. This variable is where the maximum value is recorded, because it is used in backpropagation. Then suppose the process of forward propagation and back propagation is as shown in the following figure:

3. What are the disadvantages of max pooling and mean pooling?

Pooling can increase the receptive field and allow the convolution to see more information, but it loses some information in the process of dimensionality reduction (because after all, it has become smaller, leaving only the information it considers important), so The premise of pooling to increase the receptive field is to lose some information (that is, to reduce the resolution), which has a certain impact on the accuracy location required by segmentation.

4. Backpropagation of convolution: https://blog.csdn.net/legend_hua/article/details/81590979

How does CNN parallelism work? How are model parallelism and data parallelism done on CNN? https://blog.csdn.net/xsc_c/article/details/42420167

5. What is the role of 1x1 convolution in the convolutional layer?

    1) Realize cross-channel interaction and information integration.

    2) Perform dimensionality reduction and dimensionality increase of the number of convolution kernel channels.

6. The cause of gradient disappearance and gradient explosion

6. Why does RNN produce gradient disappearance? How does LSTM solve the problem of vanishing gradient in RNN?

https://zhuanlan.zhihu.com/p/28687529

https://weizhixiaoyi.com/archives/491.html

7. The structure and principle of LSTM, GRU, etc., how to calculate the parameters of LSTM and GRU

Methods to deal with over-fitting and under-fitting: over-fitting: 1. Enhance training data 2. Reduce model complexity 3. Add regular terms 4. Use bagging method  

Underfitting: 1. Add new features 2. Increase model complexity 3. Reduce regularization coefficient

8. How to choose the activation function?

Choosing a suitable activation function is not easy, and many factors need to be considered. The usual practice is, if you are not sure which activation function is better, you can try them all, and then evaluate on the validation set or test set. Then see which one performs better, just use it.

The following are common choices:

1. If the output is 0, 1 value (two classification problem), the output layer selects the sigmoid function, and then all other units select the ReLU function.

2. If you are not sure which activation function to use on the hidden layer, the ReLU activation function is usually used. Sometimes, the tanh activation function is also used, but an advantage of ReLU is: when it is negative, the derivative is equal to 0, and it uses the subgradient when deriving the derivative . The disadvantage is that the function is not suitable for large gradient input during the training process, and it is easy to cause the gradient to always be zero.

3. sigmoid activation function: It is basically not used except that the output layer is a binary classification problem.

4. Tanh activation function: Tanh is very good, suitable for almost all occasions.

5. ReLU activation function: the most commonly used default function. If you are not sure which activation function to use, use ReLu or Leaky ReLU, and then try other activation functions.

6. If we encounter some dead neurons, we can use the Leaky ReLU function.

What does the activation function do? Convert the output into non-linearity.

The advantages and disadvantages of the activation function: https://zhuanlan.zhihu.com/p/92412922

Most activation functions are monotonic. Why is this?

https://blog.csdn.net/junjun150013652/article/details/81487059

9. What are the common loss functions?

One, 0 and 1 loss

2. Absolute value loss

Third, the square difference loss function

Fourth, the exponential loss function

Five, log loss function

6. Cross-entropy loss function (derivation of it, the most important thing)

10. The difference between the cross entropy loss function and the mean square error loss function. Why is the cross entropy loss function more commonly used than the mean square error loss function in classification?

 The gradient expression of the cross-entropy loss function with respect to the input weight is proportional to the error between the predicted value and the true value and does not contain the gradient of the activation function, while the gradient expression of the mean square error loss function with respect to the input weight contains, due to the commonly used sigmoid There is a gradient saturation zone in activation functions such as /tanh, so that the gradient of the MSE to the weight will be small, the parameter w is slow to adjust, and the training is slow, but the cross-entropy loss function does not have this problem, and the parameter w will be adjusted according to the error. Training is faster and the effect is better.

For details, please refer to: https://www.jianshu.com/p/d20e293a0d34

11. Why can Dropout prevent overfitting? What is the principle of the algorithm? What are its disadvantages? https://blog.csdn.net/program_developer/article/details/80737724

12. The realization of BN, how does the BN algorithm prevent over-fitting? How does the BN algorithm accelerate network training?

BN formula

In training, the use of BN makes all the samples in a mini-batch are associated together, so the network will not generate a definite result from a certain training sample. In fact, the meaning of this sentence is that the use of BN makes the network not rely on a certain sample to update or output during training, but rely on the data of the entire batch size, so that the parameters will not be too dependent on a certain data, and It is the data within a batch size, which prevents over-fitting to a certain extent.

BN is to pull the data distribution after each convolution into the specified distribution domain (such as the standard Gaussian distribution), so that when the parameters are learned, because the data distribution is basically the same, the learning speed will become very fast, especially for the sigmoid type In terms of function, the data distribution is pulled to the non-gradient saturation interval, which avoids the phenomenon of gradient disappearance and speeds up the training speed. https://hellozhaozheng.github.io/z_post/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0-Batch-Normalization%E6%B7%B1%E5%85 %A5%E8%A7%A3%E6%9E%90/

13. The role and difference between LN and BN   https://zhuanlan.zhihu.com/p/74516930

14. Where is BN generally used in the network: BN Algorithms like convolutional layer, pooling layer, and activation layer also input one layer. The BN layer is added before the activation function to normalize the input of the activation function, which solves the input The data is affected by offset and increase.

15. Where is dropout generally used in the model: generally used in the fully connected layer, and can also be used in the convolutional layer, but it is used less.

16. The difference between L1 regular and L2 regular (analyzed from different angles) https://blog.csdn.net/Matrix_cc/article/details/115270671

How to optimize L1 regularity? In each iteration, the coordinate axis descent method calculates a one-dimensional search along one coordinate direction at the current point, fixes the coordinate directions of other dimensions, and finds the local minimum of a function. The gradient descent always finds the local minimum of the function along the negative direction of the gradient;

17. The difference between various optimizers, how is Adadeta better than Adagrad?

https://blog.csdn.net/weixin_40170902/article/details/80092628

18. The impact of learning rate on optimization

19. How does the batch size affect the convergence speed   https://www.zhihu.com/question/32673260

20, the difference and connection of dnn, cnn, rnn

reference:

 https://zhuanlan.zhihu.com/p/97326991

 https://zhuanlan.zhihu.com/p/97311641

https://zhuanlan.zhihu.com/p/97324416

21. Does the softmax overflow, why it overflows, and how to solve the overflow

https://www.codelast.com/%E5%8E%9F%E5%88%9B-%E5%A6%82%E4%BD%95%E9%98%B2%E6%AD%A2softmax%E5% 87%BD%E6%95%B0%E4%B8%8A%E6%BA%A2%E5%87%BAoverflow%E5%92%8C%E4%B8%8B%E6%BA%A2%E5%87% BAunderflow/
22. How to solve the problem of model non-convergence  https://zhuanlan.zhihu.com/p/36369878

23. How to optimize the neural network? (Speed ​​up training and improve accuracy) https://zhuanlan.zhihu.com/p/41286585

Guess you like

Origin blog.csdn.net/Matrix_cc/article/details/105485488