Deep learning interview eight-part essay (2023.9.06)

1. Optimizer

1. What is SGD?

  • Batch gradient descent: traverses the entire data set and calculates the loss function once, which is computationally expensive and slow, and does not support online learning.
  • Stochastic gradient descent (SGD) randomly selects one piece of data each time to calculate the loss function, finds the gradient and updates the parameters. The calculation speed is fast, but the convergence performance may not be very good.
  • Batch Stochastic Gradient Descent (Min-batch SGD): Use a small batch of samples to approximate all, divide the samples into m mini-batches , each mini-batch contains n samples.
  • Stochastic Gradient Descent (SGD) using Momentum: In the stochastic gradient learning algorithm, the stride of each step is fixed, but in the momentum learning algorithm, how far each step goes depends not only on this time The gradient size also depends on the past velocity. Speed ​​is the gradient of accumulated training parameters for each round. Momentum mainly solves two problems of SGD: one is the noise introduced by the stochastic gradient method; the other is the ill-posed problem of the Hessian matrix , which can be understood as the problem that SGD swings back and forth larger than the correct gradient during the convergence process.

2. Briefly introduce the Adam algorithm

RMSprop decomposes the learning rate into an exponentially decaying average of the squared gradient. Momentum in Adam is directly incorporated into the estimation of the first moment of the gradient (exponentially weighted). Secondly, compared to RMSProp, which lacks a correction factor and causes the second-order moment estimate to be highly biased in the early stages of training, Adam also includes a bias correction that corrects the first-order moment (momentum term) and (non-central) initialized from the origin. Second moment estimate. Essentially RMSProp with a momentum term, it uses the first and second moment estimates of the gradient to dynamically adjust the learning rate of each parameter. The main advantage of Adam is that after bias correction, the learning rate of each iteration has a certain range, making the parameters relatively stable.

3. The difference between Adam and SGD

  • The disadvantage of SGD is that its update direction completely depends on the gradient calculated by the current batch, so it is very unstable.
  • The main advantages of Adam are: ① Considering the gradient update information in historical steps, it can reduce the gradient update noise. ② In addition, after deviation and correction, the learning rate of each iteration has a certain unit, making the parameters relatively stable.
  • However, Adam may overfit features that appeared in the early stage, and it is difficult for features that appear in the later stage to correct the early fitting effect. And neither SGD nor Adam can avoid local optimal problems very well.

2. Overfitting

1. What does overfitting mean? What is the cause? What are the solutions?

  • Definition: A model performs well on the training set, but performs poorly on the test set and new data.
  • Reasons for the occurrence: ① The complexity of the model is too high and there are too many parameters ② The training data is relatively small ③ The distribution of the training set and the test set is inconsistent ④ The noise data in the sample interferes too much, causing the model to remember the noise features too much
  • Solutions: ① Reduce model complexity ② Data enhancement ③ Regularization (l1, l2, dropout) ④ Early stopping

2. Comparison between overfitting and underfitting

Overfitting: The model performs well on the training set, but performs poorly on the test set and new data. The output result has high variance. The model fits the training samples well, but performs very poorly on the test set, with a large variance. 

Underfitting: The performance of the model on both the training set and the test machine is poor. The output results have a high bias. The model cannot adapt to the training samples and has a large bias.

3. The difference between bias and variance

  • Deviation: Deviation measures the deviation between the expected prediction of the learning algorithm and the real result, and characterizes the fitting ability of the learning algorithm itself.
  • Variance: Variance measures the change in learning performance caused by changes in the training set of the same size, that is, it depicts the impact of data disturbance.

3. Normalization

1. What is batch normalization (BN)? what's the effect?

  • The calculation formula of batch normalization (BN) is as follows:

  • Function: Speed ​​up the training and convergence of the network; control gradient explosion and prevent gradient disappearance.

2. What is the difference in calculation of mean and variance in BN during training and inference?

  • During training, the mean and variance are respectively the mean and variance of the corresponding dimensions of the data in the batch.
  • During inference, the mean and variance are calculated based on the expectations of all batches. The mean and variance used in inference are calculated by moving average, which can reduce the memory used to store the mean and variance of each batch.

3. The difference between LN (layer normalization) and BN

Compared with BN, LN only considers the statistical variables within a single sample, so there is no need to use the running mean and running var in the BN implementation. LN also does not need to consider the issue of input batch_size at all.

4. Why is BN normalization not applied in transformer?

  • One explanation is that the characteristics of CV and NLP data are different. For NLP data, in forward and back propagation, the batch statistics and their gradients are not stable, and the position component corresponding to each sentence in a batch is not necessarily meaningful.
  • Only by making the assumption of independent and identical distribution in a certain dimension can we achieve reasonable normalization. For cv, the images between batches are independent, but the text of nlp can essentially be regarded as a time series, and the time series is of variable length. In principle, sequences with different lengths belong to different statistical objects, so it is difficult to If stable statistics are not obtained, BN cannot be established because BN relies on the moving average to obtain a set of statistics for prediction.

5. The main reasons for gradient disappearance and gradient explosion

  • Gradient disappearance: Mainly because the network layer is deep, and secondly because an inappropriate loss function is used, which will cause the parameters close to the input layer to be updated slowly. As a result, during training, it is only equivalent to the learning of shallow networks in the following layers.
  • Gradient explosion: generally occurs in deep networks and when the weight initialization value is too large. In deep neural networks or recurrent neural networks, the gradients of errors can be accumulated and multiplied in updates. If the gradient value between network layers is greater than 1.0, then repeated multiplication will cause the gradient to grow exponentially, and the gradient will become very large, which will then cause a large update of the network weights and thus make the network unstable. Gradient explosion will cause the value of the weight to become so large during the training process that it overflows, causing the model loss to become NaN and so on.
  • Solution: Gradient shearing, setting thresholds for gradients, weight regularization, BN, shortcut for residual networks

6. Multiplication in Pytorch

  • Number multiplication operation: torch.mul
  • Vector dot product: torch.dot, calculates the dot product (inner product) of two tensors, both of which are one-dimensional vectors.
  • Matrix multiplication operation: torch.mm, the input two tensor shapes are nxc, cxm respectively, and the output is nxm;
  • torch.matmul, the operation performed on two-dimensional tensors is torch.mm. In addition, it can also be used for high-dimensional tensor operations.
  • @ symbol multiplication is equivalent to matrix multiplication torch.mm. Strictly according to the number of columns of the first parameter, it must be equal to the number of rows of the second parameter.

4. Neural Network

1. What is a convolutional network?

The inner product operation of the image and the filter matrix is ​​the convolution operation.

The image refers to different data window data; the filter matrix refers to a set of fixed weights. Because the multiple weights of each neuron are fixed, it can be regarded as a constant filter filter; the inner product refers to the element-by-element phase. The operation of multiplying and summing.

2. What is the pooling layer of CNN?

Pooling refers to taking the regional average or maximum, that is, average pooling or maximum pooling.

The figure above shows the maximum area, that is, in the left part of the figure above, 6 is the largest in the 2x2 matrix in the upper left corner, 8 is the largest in the 2x2 matrix in the upper right corner, 3 is the largest in the 2x2 matrix in the lower left corner, and 4 is the largest in the 2x2 matrix in the lower right corner. Maximum, so we get the result of the right part of the above picture: 6 8 3 4

3. How to determine the number of convolution kernel channels and the number of channels of the convolution output layer of CNN?

The number of convolution kernel channels of CNN = the number of channels of the convolution input layer

The number of channels in the convolution output layer of CNN = the number of convolution kernels

4. Briefly describe what is a generative adversarial network (GAN)?

Suppose there are two models, one is the Generative Model (hereinafter abbreviated as G), and the other is the discriminative model (hereinafter abbreviated as D). The task of the discriminative model (D) is to determine whether an instance is real or caused by Generated by the model, the task of the generative model (G) is to generate an instance to fool the discriminant model (D). The two models confront each other, and ultimately there is no real difference between the instance domains generated by the generative model, and the discriminant model cannot distinguish between natural and Model generated.

5. Briefly introduce the calculation graph of tensorflow?

 Tensorflow's calculation graph is also called a data flow graph. Data flow graphs use directed graphs of nodes and edges to describe mathematical computations. Nodes are generally used to represent applied mathematical operations, but they can also represent the starting point of data feed in and the end point of push out, or the end point of reading/writing persistent variables. "Edges" represent the input/output relationships between "nodes". These data "edges" can transport multi-dimensional data arrays with "dynamically adjustable size", that is, "tensors". The visual image of tensors flowing through a graph is why this tool is named "tensorflow". Once all tensors on the input side are ready, the nodes will be assigned to various computing devices to perform operations asynchronously and in parallel.

6. What experience do you have in deep learning (RNN, CNN) parameter adjustment?

CNN parameter adjustment mainly focuses on the optimization function, the dimension of emdbedding, and the number of layers of the residual network.

  • There are two options for optimization functions: SGD and Adam. Relatively speaking, Adam is much simpler, does not require setting parameters, and the effect is not bad.
  • Embedding will have a maximum value point as the dimension increases, that is, the effect will gradually become better as the dimension increases at the beginning. After reaching a point, and then as the dimension increases, the effect will become worse.
  • The number of layers of the residual network is related to the dimension of embedding. As the number of layers increases, the effect change is also a convex function.
  • May include the use of activation functions, dropout layers and BN layers. It is recommended to use relu as the activation function. The number of dropout layers should not be set too large. If it is too large, it will lead to non-convergence. The adjustment step can be 0.05. Generally, the optimal value can be found by adjusting it to 0.5 or 0.5.

7. Why can CNN be used in different machine learning fields? What common problems does CNN solve in these fields? How did he solve it?

The key to CNN is the convolution operation. The local connection between the convolution kernel and the convolution input layer can obtain the local feature information of the entire input or the combined features of each input feature. Therefore, the essence of CNN is to complete feature extraction or feature combination of original features, thereby increasing the expression ability of the model. Machine learning in different fields is modeled through the characteristics of data to solve problems in this field. Therefore, CNN solves the problem of feature extraction in different fields, and the method used is based on local connection/weight sharing/pooling operation/multi-level structure.

8. Why is the LSTM structure better than RNN?

LSTM has changes in forget gate, input gate, cell state, hidden information, etc.; because LSTM has in and out, and the current cell information is superimposed after being controlled by the input gate, RNN is a superposition, so LSTM can prevent the gradient from being small or explode.

9. What are the advantages and disadvantages of the three activation functions Sigmoid, Tanh, and Relu?

  • Sigmoid function, ① Advantages: The output range is between (0, 1), suitable for binary classification problems, the output can be interpreted as probability; it has smooth derivatives, and can be trained using optimization algorithms such as gradient descent. ② Disadvantages: The sigmoid function has a gradient saturation problem. When the input is very large or very small, the gradient is close to zero, causing the gradient to disappear and the training speed to slow down; and the sigmoid output is not centered on zero, which may cause the neural network to be in the training process Moderately unstable; computing the exponential operation of the sigmoid function is expensive.
  • Tanh function: ① Advantages: The output range is (-1, 1). Compared with sigmoid, the mean value is closer to zero, which helps to alleviate the vanishing gradient problem; it has smooth derivatives and can be used for optimization algorithms such as gradient descent. ② Disadvantages: There is still the problem of gradient saturation. When the input is large or small, the gradient is close to zero, causing the gradient to disappear. Computing the exponential operation of the Tanh function is also more expensive.
  • Relu function (Rectified Linear Unit): ① Advantages: The calculation is simple, you only need to compare whether the input is greater than zero, so the training speed is fast. Compared with the first two activation functions, the relu function does not saturate when activated, avoiding the vanishing gradient problem. ②Disadvantages: Some neurons may "die" (the output is always zero and not updated), which will lead to the sparsity of the network; Relu's gradient for negative inputs is zero, which may lead to gradient explosion problems and the output is not limited. May cause gradient explosion problem.
  • In order to overcome the above shortcomings, researchers have developed various improved activation functions, such as Leaky Relu, Parametric Relu, ELU and Swish, etc. These functions solve problems such as gradient saturation, gradient explosion, and neuron death to a certain extent.

10. Why introduce nonlinear excitation function?

The premise of deep learning is that a nonlinear activation function is added to the hidden layer of the neural network, which improves the nonlinear expression ability of the model and allows the neural network to approximate any complex function. Suppose there is a 100-layer fully connected neural network, and the activation functions of its hidden layers are all linear. Then the input layer to the output layer can actually be equivalently replaced by a layer of fully connected neural network, which cannot achieve true deep learning. Example: linear function f(x)=2x+3 undergoing three identical linear transformations on x is equivalent to only one linear transformation on x: f(f(f(x)))=2(2(2x+3) +3)+3=8x+21.

11. Why does the LSTM model have both sigmoid and tanh activation functions, instead of choosing to unify one sigmoid or tanh? What's the purpose of this?

  • The sigmoid function is used on various gates to generate values ​​between 0 and 1. Generally, only sigmoid is the most direct, which is equivalent to either 1 being remembered or 0 being forgotten.
  • Tanh is used in the state and output to process the data. It may be possible to use other activation functions for this.

12. How to solve the problems of RNN gradient explosion and dispersion?

In order to solve the gradient explosion problem, Thomas Mikolov first proposed a simple heuristic solution, which is to truncate the gradient to a smaller number when it is greater than a certain threshold.

13. What kind of data sets are not suitable for deep learning?

  • When the data set is too small and there are insufficient samples in the data set, deep learning has no obvious advantage over other machine learning.
  • The data set does not have local correlation characteristics. At present, the areas where deep learning performs well are mainly image/speech/natural language processing, etc. One commonality in these fields is local correlation. Pixels in images form objects, phonemes in speech signals are combined into words, and words in text data are combined into sentences. Once the combination of these feature elements is disrupted, the meaning of the representation is also changed. Data sets without such local correlations are not suitable for processing using deep learning algorithms. For example: to predict a person's health status, the relevant parameters will include various elements such as age, occupation, income, family status, etc. Disturbing these elements will not affect the relevant results.

14. How is the generalized linear model used in deep learning?

From a statistical perspective, deep learning can be viewed as a recursive generalized linear model. Compared with the classic linear model, the core of the generalized linear model is the introduction of the connection function g(.), whose form becomes y=g−1(wx+b). The recursive generalized linear model in deep learning, the activation function of neurons, is the connection function of the generalized linear model. The Logistic function of logistic regression (a type of generalized linear model) is the sigmoid function in the neuron activation function. Many similar methods have different names in statistics and neural networks, which can easily cause confusion.

15. A brief history of the development of neural networks

  • The sigmoid will saturate, causing the gradient to disappear. So there is ReLU.
  • The negative half-axis of ReLU is a dead zone, causing the gradient to become 0. So there are LeakyReLU and PReLU.
  • Emphasis on the stability of gradient and weight distribution, resulting in ELU and the newer SELU.
  • It was too deep and the gradient could not be passed down, so highway was created.
  • Simply don't even need the highway parameters and change them directly to the residuals, so ResNet is born.

16. What is the true meaning of activation function in neural network?

  • Non-linearity: that is, the derivative is not a constant. This condition is the basis of multi-layer neural networks and ensures that multi-layer networks do not degenerate into single-layer linear networks. This is also the meaning of the activation function.
  • Differentiable almost everywhere: Differentiability ensures the computability of gradients in optimization. Traditional activation functions such as sigmoid are differentiable everywhere. For piecewise linear functions such as ReLU, it is only differentiable almost everywhere (that is, it is not differentiable only at a limited number of points). For the SGD algorithm, since it is almost impossible to converge to a position where the gradient is close to zero, the limited non-differentiable points will not have a great impact on the optimization results.
  • The calculation is simple: there are many nonlinear functions. At the extreme, a multi-layer neural network can also be used as a nonlinear function, similar to how it is treated as a convolution operation in Network In Network [2]. However, the number of calculations of the activation function in the forward direction of the neural network is proportional to the number of neurons, so a simple nonlinear function is naturally more suitable as an activation function. This is one of the reasons why ReLU and the like are more popular than other activation functions that use operations such as Exp.

17. Deep neural networks easily converge to local optima. Why are they widely used?

The deep neural network "easily converges to the local optimum" is probably an imagination. The actual situation is that we may never find the "local optimum", let alone the global optimum. Many people have a view that "local optimality is the main difficulty in neural network optimization." This comes from the intuitive imagination of one-dimensional optimization problems. In the case of single variables, the most intuitive difficulty in optimization problems is that there are many local extreme values.

18. How to determine whether gradient explosion occurs?

  • The model cannot get updates from the training data, such as low loss
  • The model is unstable, causing significant changes in losses during the update process
  • During training, the model loss becomes NaN

19. What is the difference between GRU and LSTM? 

GRU stands for Gated Recurrent Units, which is a type of recurrent neural network.

  • GRU has only two gates (update and reset), and GRU directly passes the hidden state to the next unit.
  • LSTM has three gates (forget, input, output), and LSTM uses memory cells to wrap the hidden state.

Guess you like

Origin blog.csdn.net/qq_43687860/article/details/132711739
Recommended