Basic knowledge of deep learning for 2023 double non-computer master's degree candidates applying for algorithm positions in autumn

Link to obtain the word version of the information:
Link: https://pan.baidu.com/s/1H5ZMcUq-V7fxFxb5ObiktQ
Extraction code: kadm

The convolutional layer
fully connected neural network requires a lot of computing resources to support it for back propagation and forward propagation, so the fully connected neural network can store a lot of parameters. If the samples you give it do not reach its When the magnitude is large, it can easily record all the samples you give it, which will lead to overfitting.
Local perception: In traditional neural networks, each neuron must be connected to each pixel in the image, which will result in a huge number of weights and make the network difficult to train. When the human brain recognizes a picture, it does not recognize the entire picture at once, but first perceives each feature in the picture locally, and then performs a comprehensive operation on the part at a higher level to obtain global information.
Weight sharing: The weight of the convolution kernel is obtained through learning and the weight of the convolution kernel will not change during the convolution process. This is the idea of ​​parameter sharing. It shows that we extract the same (similar) features at different positions of the original image through the operation of a convolution kernel. Simply put, the characteristics of the same target at different locations in a picture are basically the same.
Receptive field: Calculation of the area size of the pixels on the feature map output by each layer of the convolutional neural network mapped on the input image: (out - 1) *
stride + ksize, where out refers to the perception of the previous layer The size of the wild, stride is the current layer stride
pooling layer
Pooling (Pooling): also called undersampling or downsampling. It is mainly used for feature dimensionality reduction, compressing the number of data and parameters, reducing overfitting, maintaining certain invariance (rotation, translation, scaling, etc.), and improving the fault tolerance of the model. The main ones are: Max Pooling: Maximum pooling Average Pooling: Average pooling
max retains texture features and extracts the value with the largest feature difference, which will discard a lot of information.
avg retains the overall data characteristics, and the retained information is very complete, but the feature differences may be reduced due to averaging.
AvgPooling is used when the information in the features has a certain contribution, and the network goes deeper. At this time, the HW of the feature map is relatively small and contains more semantic information. At this time, it is not appropriate to use MaxPooling.
Activation layer
The activation function is the source of nonlinearity in the neural network, because if these functions are removed, then the entire network will only have linear operations. The composite of linear operations is still a linear operation, and the final effect is only equivalent to a single-layer linear model.

Sigmoid:
1. If we initialize the weight of the neural network to a random value between [0,1], it can be seen from the mathematical derivation of the back propagation algorithm that when the gradient propagates from back to front, the gradient cannot exceed 0.25 , if there are too many hidden layers in the neural network, then the gradient will become very small and close to 0 after passing through multiple layers, that is, the gradient disappears. When the network weights are initialized to values ​​in the (1,+∞) interval, then Gradient explosion will occur.
2. The output of Sigmoid is not zero-mean (ie, zero-centered). This will cause the neurons in the subsequent layer to receive the non-zero mean signal output from the previous layer as input.
3. Its analytical formula contains power operations, which is relatively time-consuming to solve by computer.

Tanh:
It solves the problem of non-zero-centered output of the Sigmoid function. However, the problem of gradient vanishing and the problem of power operation still exist.

Relu:
ReLu is a piecewise linear function, its nonlinearity is very weak, so deep networks can be constructed.
The calculation speed is very fast. You only need to determine whether the input is greater than 0.
The output of ReLU is not zero-centered.


All neurons in the output layer (fully connected layer) have weighted connections between the two layers. Usually the fully connected layer is at the end of the convolutional neural network, and the final output is obtained through the softmax function. That is, the connection method of neurons in traditional neural networks is the same:

Softmax: It maps the output of multiple neurons to the (0,1) interval, which can be understood as probability, so as to perform multi-classification!

The batch normalization layer
solves the problem of the change of data distribution in the middle layer during the training process to prevent gradient disappearance or explosion and speed up training).
Once the distribution of training data and test data is different, the generalization ability of the network is also greatly reduced.
Once the distribution of each batch of training data is different (batch gradient descent), the network will have to learn to adapt to different distributions in each iteration, which will greatly reduce the training speed of the network, which is why we need to The reason why the data needs to be normalized and preprocessed.

Atrous convolution: By padding 0 in the standard convolution kernel, a convolution kernel of size 3×3 can have the same receptive field as a convolution kernel of size 5×5.
Using dilated convolutions with different hole rates can obtain convolution kernels with different receptive fields, which can deal with objects to be detected of different sizes, and can also obtain the texture and structural information of the object.
Spatial self-attention: After atrous convolution with different receptive fields, redundant information and noise will inevitably be generated. In this module, the features obtained by atrous convolution at different scales adopt the Add operation instead of the Concatenate operation. Multi-scale During the feature fusion process, if the weighted fusion method is not used, redundant information and noise will have the same response value as the features.

1.DNN (Deep Neural Network)
1.1 Neural Network
The learning of neural network is to learn how to use the linear transformation of the matrix and the nonlinear transformation of the activation function to project the original input space into a linearly separable/sparse space for classification/regression. Increase the number of nodes: increase the dimension, that is, increase the linear transformation capability. Increase the number of layers: increase the number of activation functions, that is, increase the number of nonlinear transformations
1.2 Neurons
Neural networks are composed of a large number of neurons connected to each other. After each neuron receives the input of a linear combination, it is initially simply linearly weighted. Later, a nonlinear activation function is added to each neuron to perform a nonlinear transformation and output. The connection between each two neurons represents a weighted value, called weight. Different weights and activation functions will lead to different outputs of the neural network.
1.3 Activation Function
Reference
Commonly used nonlinear activation functions include sigmoid, tanh, relu, etc. The first two, sigmoid/tanh, are more common in fully connected layers, and the latter, relu, are common in convolutional layers.

1.3.1.sigmoid

The graphical representation of function g(z) is as follows (the horizontal axis represents the domain z, and the vertical axis represents the value domain g(z)):

• In other words, the function of the sigmoid function is equivalent to compressing a real number between 0 and 1. When z is a very large positive number, g(z) will approach 1, and when z is a very small negative number, g(z) will approach 0.
shortcoming:

  1. It is easy to cause the gradient to disappear.
    Q: Vanishing gradient and exploding gradient? ways to improve.
    To solve the problem of gradient explosion:
    a. It can be solved by gradient truncation. By adding regular terms.
    To solve the problem of gradient disappearance:
    a. Change RNN and use self-loop and gate control mechanisms such as LSTM.
    b. Optimize the activation function, such as changing sigmold to relu
    c. Use batchnorm
    d. Use the residual structure
  2. The output of Sigmoid is not zero-centered.
    This will cause the neurons in the subsequent layer to receive the non-zero mean signal output from the previous layer as input.
    Result: Then the local gradients of w are all positive, so that in the process of backpropagation, w will either be updated in the positive direction, or all will be updated in the negative direction, resulting in a bundling effect, making the convergence slow.
  3. The analytical expression contains exponentiation operations.
    Computer solutions are relatively time-consuming. For larger deep networks, this will significantly increase the training time.
    1.3.2.tanh

Features: Similar to sigmoid, but the value range is [-1,1], which
solves the problem of non-zero-centered output of the Sigmoid function.
Problems of vanishing gradients and exponentiation still exist.
1.3.3.ReLU

advantage:

  1. Solved the vanishing gradient problem
  2. The calculation speed is very fast, you only need to determine whether the input is greater than 0
  3. The convergence speed is much faster than sigmoid and tanh, because the maximum gradient of these two is 0.25, and relu is 1.
    Disadvantages:
  4. Output is not zero-centered
  5. Dead ReLU Problem means that some neurons may never be activated, causing the corresponding parameters to never be updated. There are two main reasons why this situation may occur: (1) Very unfortunate parameter initialization, which is rare. For example, w is initialized to all negative numbers. (2) The learning rate is too high, resulting in too large parameter updates during the training process, which unfortunately puts the network into this state. The solution is to use the Xavier initialization method and avoid setting the learning rate too high or using algorithms such as adagrad to automatically adjust the learning rate. (See 1.7 of this article for weight initialization)
  6. The origin is not differentiable
    1.3.4.leaky relu function

Advantages:
There will be no Dead ReLU problem.
The mean value of the output is close to 0, zero-centered.
Disadvantages:
The calculation amount is slightly larger.
The origin is not differentiable .
1.3.5 Why does the neural network introduce nonlinearity
? Answer: If there is no excitation function, in this case you can The input of a layer of nodes is a linear function of the upper layer's output. No matter how many layers your neural network has, the output is a linear combination of the input, which is equivalent to having no hidden layer and the network's learning ability is limited.
The main characteristics of deep learning are: multi-layer and non-linear. Multi-layer in order to be able to learn more things; there is no non-linearity, there is no difference between multi-layer and single layer, it is a simple linear combination, even XOR cannot solve
1.4 neural network

  1. Input layer (Input layer), many neurons (Neurons) receive a large number of non-linear input messages. The input message is called the input vector.
  2. Output layer (Output layer), information is transmitted, analyzed, and weighed in neuron links to form output results. The output message is called the output vector.
  3. Hidden layer, referred to as "hidden layer", is a layer composed of many neurons and links between the input layer and the output layer. If there are multiple hidden layers, it means multiple activation functions.
    1.4 Backpropagation (to be able to derive)
    Backpropagation is a method used when solving the loss function L to derive the parameter w. The purpose is to derive the parameters layer by layer through the chain rule. The key point here is to initialize the parameters randomly instead of setting them all to 0, otherwise the values ​​of all hidden layers will be related to the input, which is called symmetric failure.
    1.5 Gradient disappearance, gradient explosion
    Gradient disappearance: Because the activation function usually used in deep networks is sigmoid, in this function, the gradients of the minimum and maximum values ​​(that is, the reciprocal) both approach 0. The back propagation of the neural network is to multiply the partial derivatives of the function layer by layer. When the number of layers of the neural network is very deep, the deviation generated by the last layer becomes smaller and smaller because it is multiplied by many numbers less than 1, and finally will become 0, resulting in the weights of shallower layers not being updated.
    Gradient explosion: In the same way, it occurs when the activation function is in the activation zone and the weight W is too large. But gradient explosion is not as likely to occur as gradient disappearance.
    1.5.1 Methods to solve the problem of vanishing or exploding gradients.
  4. Modify activation function
  5. Batch Normalization
  6. Gradient Clipping
  7. Use long short-term memory network
  8. residual structure
  9. Layer-by-layer greedy pre-training
    1.6overfitting
    1.6.0 Reasons for overfitting
  10. The data distribution of the training set and the test set are inconsistent
  11. The data set has a lot of noise and interference. The model remembers the noise information and does not fit the relationship between the input information and the label.
  12. The model iterates too much and fits unrepresentative data and noisy data in the sample.
  13. The amount of sample data is much smaller than the model complexity
    1.6.1. Solution:
    During the fitting process, the tendency is usually to make the weights as small as possible, and finally construct a model with all parameters small. Because it is generally believed that models with small parameter values ​​are relatively simple, can adapt to different data sets, and avoid overfitting to a certain extent. **You can imagine that for a linear regression equation, if the parameters are large, then as long as the data shifts a little, it will have a great impact on the results; but if the parameters are small enough, a larger data shift will not have a big impact. As for the impact, a more professional term is "strong anti-disturbance ability".
  14. L1/L2 regularization (principle Occam's razor): L1 regularization adds the sum of the absolute values ​​of all weight parameters w to the loss function, forcing more w to be 0, making the features sparse; L2 regularization is also called To perform weight attenuation, the sum of the squares of all weight w parameters is added to the objective function. For linear regression models, the model using L1 regularization is called Lasso regression, and the model using L2 regularization is called Ridge regression.
  15. early stopping
  16. Dropout: During the training process, neurons are activated with the probability of hyperparameter p (that is, the probability of 1-p is set to 0), similar to the bagging algorithm
  17. Batch Normalization (normalize the input of the next layer in the neural network so that the mean of the input is 0 and the variance is 1, that is, through feature normalization, the training of the model is accelerated)
  18. shortcut-connect (using residual network Residual network, densenet)
  19. Data enhancement (increasing the number of samples)
    1.6.2. Understanding of L1/L2 regularization
    During the fitting process, the tendency is usually to make the weights as small as possible, and finally construct a model with all parameters smaller. Because it is generally believed that models with small parameter values ​​are relatively simple, can adapt to different data sets, and avoid overfitting to a certain extent. You can imagine that for a linear regression equation, if the parameters are large, then as long as the data shifts a little, it will have a great impact on the results; but if the parameters are small enough, a larger data shift will not have a big impact on the results. What impact, a more professional term is "strong anti-disturbance ability".
    1.6.3 Understanding of Batch Normalization
    Reference
    There is a very important assumption in the field of machine learning:
    IID independent and identically distributed assumption, which assumes that the training data and the test data satisfy the same distribution. This is the model obtained through the training data can be obtained in the test set A basic guarantee for good results. So what is the role of BatchNorm? BatchNorm keeps the input of each layer of neural network the same distribution during the training process of deep neural network.
    The basic idea of ​​BN is actually quite intuitive:
    Because the activation input value of the deep neural network before nonlinear transformation (that is, x=WU+B, U is the input), as the depth of the network deepens or during the training process, its distribution gradually shifts or changes. The reason for training The convergence is slow. Generally, the overall distribution gradually approaches the upper and lower limits of the value range of the nonlinear function (for the Sigmoid function, it means that the activation input value WU+B is a large negative or positive value), so this leads to The gradient of the low-level neural network disappears during backpropagation. This is the essential reason why the convergence of deep neural networks becomes slower and slower when training deep neural networks. BN uses certain normalization methods to forcibly pull the distribution of the input values ​​of any neurons in each layer of neural networks. Returning to the standard normal distribution with a mean of 0 and a variance of 1 is actually to force the increasingly skewed distribution back to a more standard distribution, so that the activation input value falls in the area where the nonlinear function is more sensitive to the input, so that the input Small changes in will lead to larger changes in the loss function, which means that the gradient will become larger and the problem of gradient disappearance will be avoided. Moreover, the larger gradient means that the learning convergence speed will be faster, which can greatly speed up the training speed.
    Benefits of BatchNorm:
    Why BatchNorm is great? The key is its good effect. ① It not only greatly improves the training speed, but also greatly speeds up the convergence process; ② It can also increase the classification effect. One explanation is that this is a regular expression similar to Dropout to prevent over-fitting, so it can be achieved without Dropout. Quite effective; ③ In addition, the parameter adjustment process is much simpler, the initialization requirements are not so high, and a large learning rate can be used.
    1.6.4 Implementation of Batch Normalization
    Reference

As shown in the figure above, the BN step is mainly divided into 4 steps:
find the mean of each training batch data,
find the variance of each training batch data,
and use the obtained mean and variance to normalize the training data of the batch. Get a 0-1 distribution. where ε is a tiny positive number used to avoid dividing by zero.
Scale transformation and offset: Multiply xi by γ to adjust the value, and add β to increase the offset to get yi, where γ is the scale factor and β is the translation factor. This step is the essence of BN. Since the normalized xi will basically be restricted to the normal distribution, the expressive ability of the network will be reduced. To solve this problem, we introduce two new parameters: γ, β. γ and β are learned by the network itself during training.
1.6.5 How to find the mean and variance when predicting the BN test set?
During training, we will solve for the mean and variance of the same batch of data, and then perform normalization operations. But how do we find our mean and variance when predicting? For example, when we predict a single sample, how do we find the mean and method? In fact, it looks like this. The mean and variance used in the prediction stage actually come from the training set. For example, when we train the model, we record the mean and variance of each batch. After the training is completed, we find the mean and variance expectation of the entire training sample, which is used as the mean and variance of BN when we make predictions. 1.6
. 6BN Moving Average Operation
Reference
The sliding average, or exponential weighted average, can be used to estimate the local mean of a variable, so that the update of the variable is related to the historical value within a period of time.

The advantage of the sliding average:
it takes up less memory and can estimate the mean without saving the past 10 or 100 historical θ values. (Of course, the sliding average is not as accurate as saving all historical values ​​to calculate the mean, but the latter takes up more memory and has higher computational costs)
Training phase:
Use the mean and variance of the current batch to perform BN processing during training, and use sliding at the same time The average method continuously updates the global mean and variance and stores them.
Testing phase:
In the prediction phase, directly use the mean and variance stored in the model to calculate
1.7 Several methods of weight initialization

See here
1.7.1 Initialize w to 0.
When doing linear regression and logistic regression, we basically initialize the parameters to 0, and our model can also work very well. Then in the neural network, it is not possible to initialize w to 0. This is because if w is initialized to 0, then the neurons in each layer learn the same thing (the output is the same), and at bp, the neurons in each layer are also the same, because they The gradient is the same.
1.7.2 Random initialization of w
However, random initialization also has shortcomings. np.random.randn() is actually sampling from a Gaussian distribution with a mean of 0 and a variance of 1. When the number of layers of the neural network increases, you will find that the output values ​​of the activation functions (using tanH) of the later layers are almost close to 0. The output value of the activation function is close to 0, which will cause the gradient to be very close to 0, so it will lead to The gradient disappears.
1.7.3
Xavier initialization The output value of the layer's activation function tends to 0. The output value of the deep activation function still obeys the standard Gaussian distribution very beautifully. Although Xavier initialization can be a good tanH activation function, the most commonly used ReLU activation function in neural networks is still incompetent 1.7.4He
initialization
. In order to solve the above problem, our great master He Kaiming (the anecdotes about great master Kaiming are If you are interested, you can gossip, hahaha, quite interesting) proposed an initialization method for ReLU, generally called He initialization. In current neural networks, ReLU is often used for hidden layers, and He initialization is commonly used for weight initialization.
1.8 About softmax
reference
The softmax function is also called the normalized exponential function. It is the generalization of the binary classification function sigmoid in multi-classification, and its purpose is to display the results of multi-classification in the form of probability.
We know that probability has two properties: 1) the predicted probability is non-negative; 2) the sum of the probabilities of various predicted results is equal to 1.
Softmax converts the prediction results from negative infinity to positive infinity into probabilities according to these two steps.

Two steps:

  1. Convert the prediction result to a non-negative number.
  2. The sum of the probabilities of various predicted outcomes is equal to 1.
    2.CNN (Convolutional Neural Networks)
    Convolutional Neural Networks (CNN) are a type of feedforward neural networks (Feedforward Neural Networks) that include convolutional calculations and have a deep structure. They are the basis of deep learning. represents one of the algorithms.
    Super detailed introduction

2.1 What is convolution?
The inner product (element-by-element multiplication and summation) operation of the image and the filter matrix is ​​the so-called "convolution" operation, which is also the source of the name of the convolutional neural network.
A few concepts:

  1. Depth: The number of windows, which determines the depth thickness of the output. Also represents the number of filters.
  2. stride: the step size of each movement of the data window
  3. Zero-padding: Add a few circles of 0s on the outer edge so that the initial position can be slid to the end position in steps. In layman's terms, it is to make the total length divisible by the step size.
    2.1.1 Why odd-numbered convolutions are used
    ? There are two main reasons:
    • (1) Odd-numbered convolution kernels are easier to perform padding. We assume that the size of the convolution kernel is k*k. In order to make the size of the convolved image as large as the original image, we can get padding=(k-1)/2 according to the formula. If you see this, careful friends should have seen it. We have a clue, yes, the padding can only be an integer when k here is an odd number, otherwise the padding will not be easy to fill the image; • (2
    ) It is easier to find the anchor point. What does this mean? In CNN, the window is generally slid based on a certain reference point of the convolution kernel. Usually this reference point is the center point of the convolution kernel, so if k' is an even number, the center point cannot be found.
    2.1.2.The role of the 1x1 convolution kernel
    Reference
    (1) To achieve cross-channel interaction and information integration: Using the 1x1 convolution kernel is actually a transformation process of linear combination of information between different channels.
    (2) Increase nonlinear characteristics: 1x1 convolution kernel, using the subsequent nonlinear activation function can greatly increase the nonlinear characteristics while keeping the scale of the feature map unchanged, making the network very deep (3) Reduce model
    parameters , reduce the amount of calculation
    2.1.3. Why two 3x3 can replace one 5x5 convolution kernel
    in a convolutional neural network. Generally speaking, the larger the convolution kernel, the larger the receptive field (receptive field). The more image information, the better the global features obtained. Even so, a large convolution kernel will lead to a sudden increase in the amount of calculations, which is not conducive to the increase in model depth and will also reduce the computing performance.
    So in VGG and Inception networks, a combination of two 3×3 convolution kernels is used to replace a 5×5 convolution kernel. The benefits are:
    (1) Under the condition of having the same receptive field, the depth of the network is improved, and the effect of the neural network is improved to a certain extent; (2) The
    number of parameters is reduced (from 5×5×1 x channels to 3×3× 2 x channels).
    So, assuming the input is 28x28:
    convolve it with a 5x5 convolution kernel, with a stride of 1 and a padding of 0. The result is:
    (28-5 + 0x2) / 1 + 1=24
    Use 2 layers of 3x3 convolution kernels, with the same stride of 1 and padding of 0: the
    first layer is 3x3: the result is (28-3 + 0x2) / 1 + 1 = 26
    for the second layer 3x3: The result obtained is (26-3 + 0x2)/1 + 1=24,
    so the final result is that the feature map size obtained by two layers of 3x3 and a 5x5 convolution kernel is the same.
    In the same way, it can be concluded that three 3x3 convolution kernels can replace one 7x7 convolution kernel.
    2.2 Excitation layer
    The activation function sigmoid was introduced earlier, but in actual gradient descent, sigmoid is easily saturated, causing the termination of gradient transfer, and there is no zero centralization. ReLU is used as the excitation function in CNN.
    The advantages of ReLU are fast convergence and simple gradient calculation.
    2.3 Pooling layer
    Another important idea of ​​CNN is pooling. The pooling layer is usually followed by the convolution layer. In fact, the purpose of introducing it is to simplify the output of the convolutional layer. Popularly understood, the pooling layer also has a window on the convolution layer, but this window is much simpler than the window of the convolution layer. It does not require parameters such as w and b. It only performs simple operations on the neurons within the window range. , such as summing, finding the maximum value, and using the obtained value as the input value of the pooling layer neuron.
    Usually the convolution layer has multiple windows, and the pooling layer also has multiple windows. Simply put, the convolutional layer uses a window to perform convolution operations on the input layer, and the pooling layer also uses a window to perform pooling operations on the convolutional layer.
    We need to remember one of the biggest benefits of the pooling layer: after pooling, the feature values ​​we learn are greatly reduced, which also greatly reduces the parameters of the subsequent network layers.
    Although the output size (number of features) can be reduced to a large extent through convolution, it is still difficult to calculate and easy to overfit, so the static characteristics of the image are still used to further reduce the size through pooling.
    2.4 Fully connected layer
    The pooling layer to the output layer is fully connected, which is the same as DNN.
    If operations such as convolutional layers, pooling layers, and activation function layers map the original data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space. .
    2.5 What is weight sharing?
    Local perception: that is, the network is partially connected, each neuron is only connected to some neurons in the previous layer, and only perceives the local part, not the entire image (sliding window implementation). Local pixels are closely related, while distant pixels are weakly related. Therefore, only local perception is required, and local information is integrated at a higher level to obtain global information.
    Weight sharing: The information learned from a local area is applied to other parts of the image. That is, the same convolution kernel is used to convolve the entire image, and different features are achieved by multiple different convolution kernels.
    2.6 What is the difference between CNN and Dense Neural Network DNN?
    The input of DNN is in vector form and does not take into account the structural information of the plane. This structural information is particularly important in the fields of images and NLP, such as identifying numbers in images. The same number has nothing to do with its position (in other words, the weight of any position (all should be the same), the input of CNN can be tensor, such as a two-dimensional matrix, and local features are obtained through filter, which better retains the planar structure information.
    2.6.1 In image classification tasks, what are the advantages of using CNN compared to using DNN?
    While both models can capture the relationship between pixels that are close to each other, CNN has the following properties:
    • It is translation invariant: the exact position of a pixel is irrelevant to the filter.
    • Fewer parameters, less prone to overfitting: Generally speaking, there are much fewer parameters in CNN than in DNN.
    • Allows us to better understand the model: we can view the weights of the filters and visualize the learning results of the neural network.
    • Hierarchical nature: Patterns are learned by describing complex patterns using simpler patterns.
    2.7. Feature Map size calculation.
    The size of the Feature Map is equal to (input_size + 2 * padding_size − filter_size)/stride+1
    2.8 Several commonly used models. It is best to remember the approximate size parameters of the model.

2.9 Explain two methods of visualizing CNN features in image classification tasks
• Input occlusion: Occlude part of the input image to see which part has the greatest impact on classification.
For example, for a trained image classification model, take the following images as input. If we see that the probability of the third image being classified as a dog is 98%, while the accuracy of the second image is only 65%, it means that the eyes have a greater impact on the classification.
• Activation maximization: Create an artificial input image to maximize the target response (gradient ascent).
2.10 What is Group Convolution
? If the upper layer of the convolutional network has N convolution kernels, the corresponding number of channels is also N. Assume the number of groups is M. During the convolution operation, the channels are divided into M parts. Each group corresponds to N/M channels. Then, after the convolution of each group is completed, the outputs are stacked together as the output channels of the current layer. The parameters are reduced to the original 1/M.
3.RNN (Recurrent Neural Network)
Recurrent Neural Network (RNN) is a type that takes sequence data as input, performs recursion in the evolution direction of the sequence, and all recurring units Recursive neural network connected in chains

3.1RNN model
see here

3.2 The three parameters of RNN forward propagation
W, U, and V are shared.

3.3 RNN back propagation (BPTT) (hand push)
BPTT (back-propagation through time) algorithm is a commonly used method for training RNN. In fact, the essence is still the BP algorithm, but RNN processes time series data, so it must be based on time back propagation. , so it is called backpropagation over time. The central idea of ​​BPTT is the same as the BP algorithm, which is to continuously search for better points along the negative gradient direction of the parameters that need to be optimized until convergence. To sum up, the essence of the BPTT algorithm is still the BP algorithm, and the essence of the BP algorithm is still the gradient descent method, so finding the gradient of each parameter becomes the core of this algorithm.
Different from the BP algorithm, the optimization process of the two parameters W and U requires tracing back to the previous historical data. The parameter V is relatively simple and only needs to focus on the present. Then we will first solve the partial derivative of the parameter V.

Since the solution of the partial derivatives of W and U needs to involve historical data, it is relatively complicated to calculate the partial derivatives. Let us first assume that there are only three moments, then the partial derivative of L with respect to W at the third moment is:

Correspondingly, the partial derivative of L with respect to U at the third moment is:

It can be observed that the partial derivative of W or U at a certain moment needs to trace the information of all moments before this moment. This is only the partial derivative of one moment. As mentioned above, the loss will also be accumulated, so the entire loss The partial derivatives of the function with respect to W and U would be very tedious. Even so, fortunately, the rules can still be followed. Based on the above two formulas, we can write the general formula of the partial derivatives of L with respect to W and U at time t:

The overall partial derivative formula is to add them up one by one in time.
As mentioned before, the activation function is nested inside. If we put the activation function in, we will take out the cumulative multiplication part in the middle:

We will find that cumulative multiplication will lead to cumulative multiplication of activation function derivatives, which will in turn lead to the occurrence of "gradient disappearance" and "gradient explosion" phenomena.
3.4 Application scenarios of RNN
: At least three can be said: text generation from pictures, sentiment analysis, machine translation
• 1 to many
1. Generate text from images, the input is the characteristics of the image, and the output is a sentence
2. Generate speech or music based on the image , the input is image features, and the output is a piece of speech or music
• Many-to-1 (the most typical is emotional analysis)
1. Output a piece of text, determine its category
2. Enter a sentence, determine its emotional tendency
3. Input a video, Determine the category it belongs to
• Many-to-many
1. Machine translation, input a text sequence in one language, and output a text sequence in another language
2. Text summary, input a text sequence, and output a summary of this text sequence
3. Reading comprehension, input an article , output the answer to the question
4. Speech recognition, input the speech sequence information, and output the text sequence
3.5 The difference between CNN and RNN
. The similarities are discussed in terms of neural networks. The differences are space and time, dynamic and static.

4. LSTM
long short-term memory network (LSTM, Long Short-Term Memory) is a time-cyclic neural network. It is a core specially designed to solve the long-term dependency problem of general RNN (cyclic neural network): cell state
, Three gates to remove or add information to the cell state to choose how much information to let through.
See here

4.1 Forget gate
Function: discard information

4.2 Input gate layer
Function: determine updated information

4.3 Output gate layer
Function: Determine the value to be output

4.4 LSTM prevents the problem of gradient disappearance and gradient explosion.
When RNN performs BPTT calculation to update parameters, partial derivatives will be multiplied when the deep network calculates the gradient of parameters. At this time, gradient disappearance or gradient explosion will occur. When using LSTM, because the sigmoid and tanh activation functions are added, when calculating the partial derivative, the partial derivative part of the continuous multiplication is equal to 0 or 1, which solves the problems of gradient disappearance and gradient explosion.
Reference 1
Reference 2

4.5 Why LSTM uses sigmoid and tanh activation functions
1. In LSTM, all control gates use sigmoid as the activation function (forgetting gate, input gate, output gate); 2.
When calculating candidate memory or hidden states, use hyperbolic tangent The function tanh is used as the activation function;
3. The so-called saturation means that after the input exceeds a certain range, the output almost no longer changes significantly, which conforms to the definition of gating. If other non-saturated activation functions are used, it will be difficult to achieve the gating effect, so Cannot be replaced by other activation functions.
4. Use tanh as the activation function when calculating the state, mainly because its value range is (-1, 1). Tanh has a larger gradient near 0 than sigmoid, which usually makes the model converge faster.
5. GRU
5.1 Principle
Compared with LSTM, using GRU can achieve comparable results, and it is easier to train and can greatly improve training efficiency. Therefore, GRU is often preferred. (In a word, it is simple but the effect is not bad)
It combines the forget gate and the input gate into a single update gate. It also mixes cell states and hidden states, and other changes. The final model is simpler than the standard LSTM model, which is also a very popular variant.

5.2. The difference between GRU and LSTM.
The performance of GRU and LSTM are comparable on many tasks.
GRU has fewer parameters and is therefore easier to converge, but when the data set is large, LSTM has better expression performance.
Structurally speaking, GRU has only two gates (update and reset), and LSTM has three gates (forget, input, output). GRU directly passes the hidden state to the next unit, while LSTM uses memory cells to wrap the hidden state. .
6.GAN (Generative Adversarial Network)
6.1 The idea of ​​​​GAN network
GAN uses a generative model and a discriminant model. For interviews about the generative model and the discriminant model, see here. The discriminative model is used to determine whether a given picture is a real picture. The generative model generates a picture by itself that is very similar to the desired picture. At the beginning, neither model is trained, and then the two models are trained together to generate a model. Pictures are generated to deceive the discriminant model, and the discriminant model distinguishes authenticity from falsehoods. In the end, during the training process, the two models become increasingly capable and eventually reach a steady state.
7. EM algorithm
The EM algorithm is used for maximum likelihood estimation or maximum posterior estimation of models containing hidden variables. It consists of two steps: E step, finding expectation (expectation); M step, finding maximum (maxmization). In essence, the EM algorithm is an iterative algorithm that continuously uses the estimates of hidden variables from the previous generation parameters to calculate the current variables until convergence.
Note: The EM algorithm is sensitive to the initial value, and EM is an algorithm that continuously solves the maximization approximation of the lower bound to solve the maximization of the log-likelihood function, which means that the EM algorithm cannot guarantee to find the global optimal value. You should also master the export method of EM.
8.HMM (hidden Markov model)
will not be written for now
9.CRF
will not be written for now
10. Deep learning parameter update method

11. How to solve the imbalance problem of sample categories?
• Oversampling/upsampling: Increase the number of samples with fewer categories to achieve a balanced sample number. Specifically, multiple pieces of data are formed by copying samples in categories. The disadvantage of this method is that overfitting is prone to occur when the sample has few features. The oversampling method needs to be improved. The improved method is to add noise and interference data to samples with few categories or generate newly synthesized samples through certain rules, such as the smote algorithm.
• Undersampling/downsampling: Reduce the number of samples with many categories. The general method is to randomly remove some samples with many categories. The shortcomings of downsampling are also obvious, that is, the final training set loses data, and the model only learns part of it, which may lead to underfitting. The solution: multiple downsampling (sampling with replacement, so that the training set generated is only Independent of each other) generate multiple different training sets, then train multiple different classifiers, use model fusion, and obtain the final result by combining the results of multiple classifiers.
• The loss function uses different category weights: high weights are given to samples with few categories, and low weights are given to samples with many categories.
• Choose a loss function other than accuracy: In the case of unbalanced samples, the high accuracy obtained is meaningless, and the accuracy will be invalid. For the problem of data imbalance in machine learning, it is recommended to use more PR (Precision-Recall curve) instead of ROC curve.
● The role of BatchNormalization
Reference answer:
As the number of network layers deepens during neural network training, the overall distribution of the input values ​​​​of the activation function gradually approaches the upper and lower limits of the value range of the activation function, resulting in low-level errors during backpropagation. Vanishing gradients of neural networks. The function of Batch Normalization is to bring the increasingly biased distribution back to the standardized distribution through normalization, so that the input value of the activation function falls in the area where the activation function is more sensitive to the input, thereby making the gradient larger and speeding up learning. Convergence speed to avoid the problem of vanishing gradient.
● The gradient disappears
. Reference answer:
In a neural network, the learning rate of the front hidden layer is lower than the learning rate of the subsequent hidden layer, that is, as the number of hidden layers increases, the classification accuracy decreases. This phenomenon is called the vanishing gradient problem.
● Recurrent neural network, why is it good?
Reference answer:
Recurrent neural network model (RNN) is an artificial neural network in which nodes are directionally connected into a ring. It is a feedback neural network. RNN uses internal memory to process any sequence of inputs. , and there are both internal feedback connections and feedforward connections between its processing units, which makes it easier for RNN to process unsegmented text.
● What is Group Convolution
? Reference answer:
If the upper layer of the convolution network has N convolution kernels, the corresponding number of channels is also N. Suppose the number of groups is M. During the convolution operation, the channels are divided into M parts. Each group corresponds to N/M channels. Then, after the convolution of each group is completed, the outputs are stacked together as the output channels of the current layer.
● What is RNN
reference answer:
The current output of a sequence is also related to the previous output. In the RNN network structure, the input of the hidden layer not only includes the output of the input layer but also the output of the hidden layer at the previous moment. The network will The information is memorized and applied to the current input calculation.
● During the training process, if a model does not converge, does it mean that the model is invalid? What are the reasons why the model does not converge? Reference answer: It does not mean that the model is invalid. The reason why the model does not converge may be that the
annotation
of data classification is not correct. Accurate, the amount of information in the sample is too large and the model is not able to fit the entire sample space. If the learning rate is set too high, it will easily cause oscillation, and if it is too small, it will lead to non-convergence. Perhaps a complex classification task uses a simple model. The data has not been normalized.
● Sharpening and smoothing operations in image processing
. Reference answer:
Sharpening is to reduce the blur in the image by enhancing high-frequency components. It enhances the edges of the image and also increases the noise of the image.
Smoothing is the opposite of sharpening, filtering out high-frequency components, reducing image noise and making the image blurry. ● What are the advantages of
VGG using a 3 3 convolution kernel? Reference answer: Two 3 3 convolution kernels in series and a 5 5 convolution kernel have the same receptive field, and the former has fewer parameters. Multiple 3 3 convolution kernels have more layers of nonlinear functions than a larger size convolution kernel, which increases nonlinear expression and makes the decision function more decisive. ● How is Relu better than Sigmoid? Reference answer: The derivative of Sigmoid only has better activation when near 0, while the gradient in the positive and negative saturation areas tends to 0, resulting in the phenomenon of gradient dispersion, while relu The gradient is constant in the part greater than 0, so there will be no gradient dispersion. Relu's derivatives are calculated faster. The derivative of Relu in the negative half area is 0, so when the neuron activation value is negative, the gradient is 0. This neuron does not participate in training and is sparse. ● Question: What is the weight sharing in neural network? Reference answer: Convolutional neural network, recurrent neural network Analysis: Direct explanation through network structure ● Question: Neural network activation function? Reference answer: sigmod, tanh, relu analysis: need to master the function image, characteristics, mutual comparison, advantages and disadvantages, and improvement methods ● Question: In deep learning, it is usually finetuning the existing mature model, and then modify the last few parameters based on new data Layer neural network weights, why? Reference answer:















The quality of data sets in practice varies, and trained networks can be used to extract features. Treat the trained network as a feature extractor.
● Question: Draw GRU structure diagram
. Reference answer:

GRU has two gates: update gate and output gate.
Analysis: If you don’t know how to draw GRU, you can draw LSTM or RNN. Or you can explain the connection and difference between GRU and the other two networks. Don't just say no.
● The role of the Attention mechanism.
Reference answer:
Reduce the computational burden of processing high-dimensional input data, and select a subset of the input in a structured manner, thereby reducing the dimensionality of the data. It makes it easier for the system to find useful information related to the current output information in the input data, thereby improving the quality of the output. Help model frameworks like decoder better learn the relationships between multiple content modalities.
● The principles of Lstm and Gru
Reference answer:
Lstm consists of an input gate, a forgetting gate, an output gate and a cell. The first step is to decide what information to discard from the cell state, then decide how much new information will enter the cell state, and finally decide what information to output based on the current cell state.
Gru consists of a reset gate and a follow-up gate. Its input is the output of the hidden layer at the previous moment and the current input, and the output is the information of the hidden layer at the next moment. The reset gate is used to calculate the output of the candidate hidden layer, and its function is to control how much of the hidden layer at the previous moment is retained. The function of the new gate is to control how much output information of candidate hidden layers is added to obtain the output of the current hidden layer.
● What is dropout
reference answer:
During the training process of neural network, neural units are randomly dropped from the network with a certain probability, so as to achieve the effect of training different networks for each mini-batch to prevent over-simulation. combine.
● Calculation formula of each gate of LSTM Reference
answer:
Forgetting gate:
Input gate:
Output gate:
● Principle of DropConnect
Reference answer:
A method to prevent overfitting. Different from dropout, it does not clear the node output of the hidden layer to 0 according to probability, but clears the input weight connected to each node to 0 with a certain probability.

● What is the role of the BN layer? Why do we need to add gamma and beta at the end? Is it okay if we don’t add them?
Reference answer:
The role of the BN layer is to pull all the data in a batch from an irregular distribution to a normal distribution. The advantage of this is that the data can be distributed in the sensitive area of ​​the activation function. The sensitive area is the area with larger gradient, so the error propagation can be fed back faster during backpropagation.
● The problem of gradient disappearance and gradient explosion.
Reference answer:
The reason for the activation function is that the gradient during the gradient derivation process is very small and the error cannot be effectively backpropagated, causing the problem of gradient disappearance.
● Adam
reference answer:
The Adam algorithm is different from traditional stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (i.e. alpha) to update all weights, and the learning rate does not change during the training process. Adam designs independent adaptive learning rates for different parameters by calculating the first-order moment estimate and the second-order moment estimate of the gradient.
●Attention mechanism
Reference answer:
Attention is simply understood as weight distribution. Let’s use the attention formula in seq2seq as an explanation. It is to assign a weight to each input word. The weight is calculated by comparing it with the hidden layer moment of the decoder. The meaning of the obtained weight is that the greater the weight, the more important the word is. Final weighted sum.

● RNN gradient disappearance problem, why LSTM and GRU can solve this problem.
Reference answer:
Because RNN has a deep network, the output error of the later layers hardly affects the calculation of the previous layers. A certain unit of RNN is mainly affected by its nearby units. Because LSTM can memorize some long-term information through valves, it accordingly retains more gradients. The GRU can also retain long-term memory by resetting and updating the two valves, which also relatively solves the problem of gradient disappearance.
● The idea of ​​GAN network
Reference answer:
GAN uses a generative model and a discriminant model. The discriminant model is used to judge whether a given picture is a real picture. The generative model generates a picture that is very similar to the desired picture. At the beginning Neither model has been trained, and then the two models perform adversarial training together. The generation model generates pictures to deceive the discriminant model, and the discriminant model distinguishes between true and false. In the end, during the training process, the two models become more and more powerful and finally reach a steady state. .
● 1*1 convolution
reference answer:
realize cross-channel interaction and information integration, realize dimensionality reduction and dimensionality increase of the number of convolution kernel channels, can realize linear combination of multiple feature maps, and can realize full connection layer Equivalent effect.
● How to improve the generalization ability of the network.
Reference answer:
Improve performance from data: collect more data, scale and transform the data, combine features and redefine the problem.
Improve performance from algorithm tuning: use reliable model diagnostic tools to diagnose the model, initialize weights, and initialize weights with small random numbers. Adjust the learning rate, try to choose an appropriate activation function, adjust the topology of the network, adjust the size of batch and epoch, add regularization methods, try to use other optimization methods, and use early stopping.
● What is seq2seq model
reference answer:
Seq2seq is a type of encoder-decoder structure that uses two RNNs, one as an encoder and one as a decoder. The encoder is responsible for compressing the input sequence into a vector of specified length. This vector can be regarded as the semantics of this sequence, and the decoder is responsible for generating the specified sequence based on the semantic vector.
● The role of the activation function
. Reference answer:
The activation function is used to add nonlinear factors to improve the expression ability of the neural network to the model and solve problems that cannot be solved by the linear model.
● Why use relu instead of sigmoid?
Reference answer:
The derivative of sigmoid only has better activation when it is near 0. The gradient in the positive and negative saturation areas is close to 0, which will cause gradient dispersion. The gradient of the relu function is constant in the part greater than 0, and the gradient dispersion phenomenon will not occur. The derivative of the Relu function in the negative half area is 0, which means that this neuron will not undergo training, which is the so-called sparsity. And the derivative of the relu function is calculated faster.
● Tell us about the speech recognition process of static decoding network based on WFST?
Reference answer:
Starting from the speech features, I talked about the principles and extraction process of MFCC and LPC. This part was very detailed. Then I talked about the viterbi decoding process, and finally gave an overview of the HCLG.fst construction process. ● Do you understand target detection
? What is the difference between Faster RCNN and RCNN?
Reference answer:
Target detection, also called target extraction, is an image segmentation based on target geometric and statistical characteristics. It combines target segmentation and recognition into one, and its accuracy and real-time performance It is an important capability of the entire system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic target extraction and recognition is particularly important.
With the development of computer technology and the widespread application of computer vision principles, the use of computer image processing technology to conduct real-time tracking of targets is becoming more and more popular. Dynamic real-time tracking and positioning of targets is widely used in intelligent transportation systems, intelligent monitoring systems, and military target detection. It has wide application value in aspects such as positioning of surgical instruments in medical navigation surgery.
Disadvantages of using method to improve
R-CNN
1. SS extracts RP;
2. CNN extracts features;
3. SVM classification;
4. BB box regression. 1. The training steps are cumbersome (fine-tuning network + training SVM + training bbox);
2. Both training and testing are slow;
3. Training takes up space 1. It directly increases from 34.3% of DPM HSC to 66% (mAP);
2. Introduction of RP +CNN
Faster R-CNN
1. RPN extracts RP;
2. CNN extracts features;
3. softmax classification;
4. Multi-task loss function border regression. 1. It is still unable to achieve the real-time detection target;
2. Obtaining the region proposal and then classifying each proposal is still relatively computationally intensive. 1. Improves detection accuracy and speed;
2. Realizes an end-to-end target detection framework;
3. It only takes about 10ms to generate a suggestion frame.
● Do you understand SPP and YOLO?
Reference answer:
Introduction to SPP-Net:
The main improvements of SPP-Net are the following two:
1). Shared convolution calculation, 2). Spatial pyramid pooling.
SPP-Net also consists of these parts:
The area suggestion boxes of ss algorithm, CNN network, SVM classifier, and bounding box
ss algorithm are also generated on the original image, but they are extracted on Conv5. Of course, due to the change in size, scale transformation is required when extracting on the Conv5 layer. This is The biggest difference between R-CNN and SPP-Net is the reason why SPP-Net can significantly shorten the time. Because it makes full use of convolution calculations, that is, each picture is only convolved once. However, this improvement brings a new problem. Since the scale of the recommended frame generated by the SS algorithm is inconsistent, the size of the recommended frame extracted on cov5 The feature scales are also inconsistent, so there is no way to do full-size convolution (Alexnet).
Therefore, SPP-Net needs an algorithm that can produce a unified output from inconsistent inputs. This is SPP, which is spatial pyramid pooling. It replaces the pooling layer in R-CNN. In addition, it is the same as R-CNN is the same.
Detailed explanation of YOLO:
The name YOLO You only look once is a high-level summary of its own characteristics. The core idea of ​​YOLO is to solve target detection as a regression problem. YOLO first divides the image into SxS areas. Note that the concept of this area is different from the area mentioned above that divides the image into N areas and throws it into the detector. The area mentioned above is really cropping the image, or throwing a certain local pixel of the image into the detector, and the divided areas here are only logical divisions.
● How to solve gradient disappearance and gradient explosion.
Reference answer:
1) Use activation functions such as ReLU, LReLU, ELU, maxout, etc.
The gradient of the sigmoid function disappears as x increases or decreases, but ReLU does not.
2), use batch normalization
Through the normalization operation, the output signal x is normalized to a mean value of 0 and a variance of 1 to ensure the stability of the network. From the above analysis, we can see that there is w in the backpropagation formula, so the size of w affects the disappearance and explosion of gradients. Batch Normalization is a method in which the output of each layer is standardized to have the same mean and variance. It eliminates the effect of zooming in and out caused by w, thereby solving the problems of gradient disappearance and explosion.
● RNN is prone to gradient disappearance, how to solve it?
Reference answers:
1). Clipping Gradient.
Since the gradient will disappear during the BP process (that is, the partial derivative is infinitely close to 0, causing the long-term memory to be unable to be updated), then the simplest and crudest method is to set the threshold. When the gradient When it is less than the threshold, the updated gradient is the threshold.
Advantages: Simple and crude
Disadvantages: It is difficult to find a satisfactory threshold
2). LSTM (Long Short-Term Memory)
imitates long-term memory to a certain extent. Compared with gradient clipping, the biggest advantage is that automatic learning can Error backpropagation automatically controls what needs to be stored as memory in the LSTM cell. The general long-term memory model includes the three processes of writing, reading, and forgetting, which correspond to the
three gates of input_gate, output_gate, and forget_gate in LSTM. The range is between 0 and 1, which is equivalent to weighting the input and output. Learning, using a large amount of data to automatically learn weighted parameters (that is, learning which errors can be used to update parameters with BP). Specific formula expression:

Advantages: The model automatically learns and updates parameters.
● What is the difference between LSTM and RNN?
Reference answer:
Comparison between LSTM and RNN.
RNN has flaws when dealing with long term memory, so LSTM came into being. LSTM is a variant of RNN. Its essence lies in the introduction of the concept of cell state. Unlike RNN, which only considers the most recent state, the cell state of LSTM determines which states should be kept and which states should be forgotten.
Let’s take a look at some of the differences in the internal structures of RNN and LSTM:
RNN

LSTM

As can be observed from the above two figures, the LSTM structure is more complex. In RNN, the past output and the current input are concatenate together, and tanh is used to control the output of both. It only considers the state of the latest moment. In RNN there are two inputs and one output.
In order to remember the long-term state, LSTM adds an input and an output to the RNN. The added path is the cell state, which is the top path on the way. In fact, the entire LSTM is divided into three parts:
1) Which cell states should be forgotten
2) Which new states should be added
3) Based on the current state and current input, what should the output be?
Let’s discuss them separately:
1) Which The cell state should be forgotten.
This part of the function is implemented through the sigmoid function, which is the leftmost channel. Determine whether there is anything in the current cell state that needs to be forgotten based on the input and the output of the previous moment. For example, if there was a subject in the previous cell state, and there is a subject in the input, then the original existing subject should be forgotten. After the input of concatenate and the output of the previous moment pass through the sigmoid function, the closer it is to 0, the more it is forgotten, and the closer it is to 1, the less it is forgotten.
2) Which new states should be added?
Continuing the above example, the new subject is naturally the content that should be added to the cell state. In the same way, the sigmoid function is used to determine what content should be remembered. But it is worth mentioning that the content that needs to be remembered is not
the input of direct concatenate and the output of the previous moment, but also needs to go through tanh. This should be consistent with RNN. And it should be noted that the sigmoid here is different from the w and b of the sigmoid layer in the previous step, and they are separately trained layers.
After the cell state forgets what should be forgotten and remembers what should be remembered, it can be used as the input of the cell state at the next moment.
3) Based on the current state and current input, what should the output be?
This is the rightmost path. It also uses the sigmoid function as a gate and performs tanh filtering on the state obtained in the second step to obtain the final prediction result.
In fact, LSTM is based on RNN and adds filtering of past states, so that it can select which states have more influence on the current state instead of simply selecting the most recent state.
After that, researchers implemented various LSTM variant networks. What remains unchanged is that the sigmoid function is usually used as a gate to filter states or inputs. And the output must go through the tanh function. Specifically why these two functions are used, I can't give a certain explanation because I am new to it. I will add more in the future when I understand it.
● What is the difference between convolutional layer and pooling layer
? Reference answer:
Convolutional layer and pooling layer
function to extract features and compress feature maps.
Unfortunately, the main feature extraction operation is two-dimensional. For three-dimensional data such as RGB images (3 channels), convolution The depth of the kernel must be the same as the number of input channels, and the number of output channels is equal to the number of convolution kernels.
The convolution operation changes the number of channels of the input feature map. Pooling only operates on 2D data and therefore does not change the number of input channels. For multi-channel input, this is very different from convolution.
Feature weight sharing: reduces the number of parameters and takes advantage of the position independence of image targets.
Sparse connection: Each value of the output depends only on part of the input value.
● What are the methods to prevent overfitting?
Reference answers:
1) Dropout; 2) Adding L1/L2 regularization; 3) BatchNormalization; 4) Network bagging
● Tell me about dropout.
Reference answers:
The goal of Dropout is to approximate this process on an exponential number of neural networks. Dropout training is not the same as bagging training. In the case of bagging, all models are independent.
In the case of Dropout, the models are parameter-shared, where each model inherits a different subset of the parameters of the parent neural network. Parameter sharing makes it possible to represent an exponential number of models with limited available memory. In the case of bagging, each model is trained on its corresponding training set until convergence.
In the case of dropout, usually most of the model is not explicitly trained, and often the model is so large that it would not be possible to sample all possible subnetworks until the universe is destroyed. Instead, a small subset of possible subnetworks is trained in a single step, with parameter sharing resulting in good parameter settings for the remaining subnetworks.
●relu
reference answer:
In deep neural networks, a type called Rectified linear unit (ReLU) is usually used as the activation function of neurons. ReLU originated from neuroscience research: In 2001, Dayan and Abott simulated a more accurate activation model for brain neurons to receive signals from a biological perspective, as shown below:

The horizontal axis is time (ms), and the vertical axis is the firing rate (Firing Rate) of neurons. In the same year, neuroscientists such as Attwell studied the energy consumption process of the brain and speculated that the working mode of neurons is sparse and distributed; in 2003, neuroscientists such as Lennie estimated that only 1~4% of the neurons in the brain are activated at the same time, which further Indicates the sparsity of neuron work. For the ReLU function, how does similar performance manifest itself? What are its advantages over other linear functions (such as purlin) and nonlinear functions (such as sigmoid, hyperbolic tangent)? Now, please allow me to explain slowly.
First, let’s take a look at the form of the ReLU activation function, as shown below:

It is not difficult to see from the above figure that the ReLU function is actually a piecewise linear function, changing all negative values ​​​​to 0, while leaving the positive values ​​unchanged. This operation is called one-sided suppression. Don't underestimate this simple operation. It is precisely because of this unilateral inhibition that the neurons in the neural network also have sparse activation. Especially in deep neural network models (such as CNN), when N layers are added to the model, theoretically the activation rate of ReLU neurons will be reduced by 2 to the Nth power. Some children here may ask: Why does the function image of ReLU have to look like this? Can it be reversed or extended downward? In fact, it doesn’t have to look like this. As long as it can play a unilateral inhibitory role, whether it is a mirror flip or a 180-degree flip, the final output of the neuron is just equivalent to adding a constant coefficient, which does not affect the training results of the model. The reason for this determination may be to fit in with the biological perspective and facilitate our understanding.
So the question is: what does this sparsity do? In other words, why do we need to make neurons sparse? Let me give you a chestnut to illustrate. When watching Detective Conan, we can think and reason based on the storyline, which uses the left hemisphere of our brain; and when watching The Masked Singer, we can hum along with the singer, which uses It's our right hemisphere. The left hemisphere focuses on rational thinking, while the right hemisphere focuses on emotional thinking. In other words, when we are performing calculations or appreciating, some neurons will be in an activated or inhibited state, which can be said to be performing their respective duties. For another example, if you go to the hospital to see a doctor when you are sick, there are hundreds of indicators in the examination report, but there are usually only a few related to the condition. Similarly, when training a deep classification model, there are often only a few features related to the target. Therefore, the sparse model implemented through ReLU can better mine relevant features and fit the training data.
In addition, compared with other activation functions, ReLU has the following advantages: for linear functions, ReLU has stronger expressive ability, especially in deep networks; and for nonlinear functions, ReLU has the advantage of non-negative intervals due to its The gradient is constant, so there is no vanishing gradient problem (Vanishing Gradient Problem), which keeps the convergence speed of the model in a stable state. Here is a brief description of what the vanishing gradient problem is: when the gradient is less than 1, the error between the predicted value and the true value will attenuate once each layer is propagated. This phenomenon is particularly obvious if sigmoid is used as the activation function in a deep model. Will cause the model convergence to stagnate
● Why is CNN better than DNN in image recognition
? Reference answer:
The input of DNN is in vector form and does not take into account the structural information of the plane. This structural information is particularly important in the fields of images and NLP, for example Recognize numbers in images. The same number has nothing to do with its location (in other words, the weight of any position should be the same). The input of CNN can be tensor, such as a two-dimensional matrix. Local features are obtained through filter, which better retains the plane. structural information.
● Is there a correlation between the weights of the CNN models used?
Reference answer:
There is a correlation between weights. CNN is weight sharing, which reduces the number of parameters.
To put it simply, it is to use a convolution kernel to convolve with an image. Remember that it is the same convolution kernel and the value of the convolution kernel does not change. This can reduce the weight parameters. Sharing means that a picture is shared by the convolution kernel. For an image of 100 100 pixels, if we use a neuron to operate the image, the size of this neuron is 100 100 = 10000. If we use a convolution kernel of 10 10, although we need to calculate it multiple times, we There are only 10 10=100 parameters required . Adding a bias b, only 101 parameters are needed in total. We get the image size or 100100. If we get a larger image, it will have more parameters. We extract features from the image through a 10 10 convolution kernel, so that we get a Feature Map.
A convolution kernel can only extract one feature, so we need more convolution kernels. Suppose we have 6 convolution kernels, we will get 6 Feature Maps. These 6 Feature Maps are combined into one neuron. . We need 101*6=606 parameters for these 6 Feature Maps. This value is still relatively small compared to 10,000. If two neural networks are connected like the previous ones, 28x28 = 784 input layers are needed, plus 30 neurons in the first hidden layer, then 784x30 plus 30 b are needed, a total of 23,550 parameters! 40 times more parameter.
● Why does neural network use cross entropy?
Reference answer:
When solving multi-classification problems through neural networks, one of the most common ways is to set n output nodes in the last layer. This is true in both shallow neural networks and CNNs. For example, in AlexNet the last The output layer has 1,000 nodes, and even if ResNet cancels the fully connected layer, there will be an output layer of 1,000 nodes at the end.
Generally, the number of nodes in the last output layer is equal to the number of targets in the classification task. Assuming that the final number of nodes is N, then for each example, the neural network can obtain an N-dimensional array as the output result, and each dimension in the array corresponds to a category. In the most ideal case, if a sample belongs to k, then the output value of the output node corresponding to this category should be 1, while the output of other nodes should be 0, that is, [0,0,1,0,…. 0,0], this array is the Label of the sample, and is the most expected output result of the neural network. Cross entropy is used to determine how close the actual output is to the expected output.

The difference between LSTM and Naive RNN
Reference answer:
The difference in the internal structure of RNN and LSTM:

RNN

LSTM
can be observed from the above two figures. The LSTM structure is more complex. In RNN, the past output and the current input are concatenate together, and the output of both is controlled by tanh. It only considers the state of the latest moment. In RNN there are two inputs and one output.
In order to remember the long-term state, LSTM adds an input and an output to the RNN. The added path is the cell state, which is the top path on the way. In fact, the entire LSTM is divided into three parts:
1) Which cell states should be forgotten
2) Which new states should be added
3) Based on the current state and current input, what should the output be?
Let’s discuss them separately:
1) Which The cell state should be forgotten.
This part of the function is implemented through the sigmoid function, which is the leftmost channel. Determine whether there is anything in the current cell state that needs to be forgotten based on the input and the output of the previous moment. For example, if there was a subject in the previous cell state, and there is a subject in the input, then the original existing subject should be forgotten. After the input of concatenate and the output of the previous moment pass through the sigmoid function, the closer it is to 0, the more it is forgotten, and the closer it is to 1, the less it is forgotten.
2) Which new states should be added?
Continuing the above example, the new subject is naturally the content that should be added to the cell state. In the same way, the sigmoid function is used to determine what content should be remembered. But it is worth mentioning that the content that needs to be remembered is not the input of direct concatenate and the output of the previous moment, but also needs to go through tanh. This should be consistent with RNN. And it should be noted that the sigmoid here is different from the w and b of the sigmoid layer in the previous step, and they are separately trained layers. After the cell state forgets what should be forgotten and remembers what should be remembered, it can be used as the input of the cell state at the next moment.
3) Based on the current state and current input, what should the output be?
This is the rightmost path. It also uses the sigmoid function as a gate and performs tanh filtering on the state obtained in the second step to obtain the final prediction result. In fact, LSTM is based on RNN and adds filtering of past states, so that it can select which states have more influence on the current state instead of simply selecting the most recent state.
12. Gradient update method
Reference answer:
1) Batch Gradient Descent method BGD
Batch Gradient Descent (BGD for short) is the most primitive form of gradient descent method. The loss function in the gradient descent method is performed on the entire data set The calculated mean value, so every time the model parameters are updated, a calculation must be performed on the entire data set. Its mathematical form is as follows:

(1) Find the partial derivative of the above energy function:

Advantages: Global optimal solution; easy to implement in parallel;

Disadvantages: When the number of samples is large, the training process will be very slow.

2) Stochastic gradient descent method SGD

Since the batch gradient descent method requires all training samples when updating each parameter, the training process will become extremely slow as the number of samples increases. Stochastic Gradient Descent (SGD for short) was proposed to solve the shortcomings of the batch gradient descent method.

Write the above energy function as follows:

Advantages: fast training; can jump out of local minima

Disadvantages: Decreased accuracy, not global optimal; not easy to implement in parallel.

3) Mini-batch gradient descent method MBGD

It can be seen from the above two gradient descent methods that each has its own advantages and disadvantages. So can a compromise be achieved between the performance of the two methods? That is, the training process of the algorithm is relatively fast, and the accuracy of the final parameter training must also be ensured, which is the original intention of the Mini-batch Gradient Descent (MBGD) method.

Samples are taken one mini-batch at a time and the gradient is calculated using the mini-batch. This can make the calculated gradient value more consistent with the expected gradient value than sampling only one sample at a time, eliminating and reducing the interference of abnormal samples.

4) The idea of ​​the AdaGrad
algorithm is to adapt each parameter of the model independently: parameters with larger partial derivatives correspond to a larger learning rate, while parameters with smaller partial derivatives correspond to a smaller learning rate. The learning rate of each parameter will scale each parameter inversely proportional to the square root of the sum of its historical gradient square values. The
learning rate is monotonically decreasing. If the learning rate is too small in the late training period, the training will be difficult or even end early.
A global initial learning rate needs to be set.

4) RMSProp
RMSProp is mainly to solve the problem of excessive attenuation of the learning rate in the AdaGrad method - AdaGrad shrinks the learning rate based on the entire history of the squared gradient, which may make the learning rate become too small before reaching the local minimum and make it difficult to continue training. ;
RMSProp uses exponential decay averaging (recursively defined) to discard distant history, allowing it to quickly converge after finding a "convex" structure; in addition, RMSProp also adds a hyperparameter ρ to control the decay rate.

5) Adam
Adam goes one step further based on the RMSProp method:
in addition to adding the exponential decay average (r) of the square of the historical gradient,
it also retains the exponential decay average (s) of the historical gradient, which is equivalent to momentum.
Adam behaves like a small ball with friction, tending towards a flat minimum on the error surface.

Guess you like

Origin blog.csdn.net/qq_41950533/article/details/129181813