Deep Learning Inference and Training

Optimization and Generalization
The fundamental problem with deep learning is the tension between optimization and generalization.
Optimization refers to tuning the model to achieve the best performance on the training data (i.e. learning in machine learning).
Generalization refers to how well a trained model performs on previously unseen data .
Classification of Datasets
Datasets can be divided into:
1. Training set: the data set of the actual training algorithm; used to calculate the gradient and determine the update of the network weights in each iteration;
2. Validation set: a data set used to track its learning effect; it is an indicator to show what happened to the network function formed between the training data points, and the error value on the validation set will be monitored throughout the training process;
3. Test set: The data set used to generate the final results.
In order for the test set to effectively reflect the generalization ability of the network:
1. The test set must never be used in any form to train the network, even for selecting a network from the same set of candidate networks. The test set can only be used after all training and model selection is complete;
2. The test set must represent all situations involved in network use.
Cross-validation
There is a bunch of data here, we cut it into 3 parts (of course it can be divided into more)
The first part is the test set, the second and third parts are the training set, and the accuracy is calculated;
The second part is the test set, the first and third parts are the training set, and the accuracy is calculated;
The third part is used as a test set, and the first and second parts are used as a training set to calculate the accuracy;
After that, the average value of the three accuracies is calculated as the final accuracy.

 

bp neural network
BP network (Back-Propagation Network) was proposed in 1986, it is a kind of error backpropagation algorithm training
Multi-layer feed-forward network is one of the most widely used neural network models, used for function approximation, model identification and classification, data compression
downscaling and time series forecasting, etc.
BP network, also known as backpropagation neural network, is a supervised learning algorithm with strong self-adaptive, self-learning,
The linear mapping ability can better solve the problems of less data, poor information, and uncertainty, and is not limited by nonlinear models.
A typical BP network should include three layers: input layer, hidden layer and output layer. Each layer is fully connected, and there is no connection between the same layer.
Hidden layers can have many layers.
1. Input the training set data into the input layer of the neural network, go through the hidden layer, and finally reach the output layer and output the result. This is the previous
to the propagation process.
2. Since there is an error between the output result of the neural network and the actual result, calculate the error between the estimated value and the actual value, and calculate the error
Backpropagation from the output layer to the hidden layer until it propagates to the input layer;
3. In the process of backpropagation, adjust the values ​​of various parameters (weights of connected neurons) according to the error, so that the total loss function decreases
Small.
4. Iterate the above three steps (i.e. train repeatedly on the data) until the stopping criterion is satisfied.
When we use neural networks to solve problems such as image segmentation and boundary detection, our input (assumed to be x), and the expected
What exactly is the relationship between the outputs (let's say y)? That is, in y=f(x) , we don't know what f is , but we
I am very sure about one point, that is, f is not a simple linear function , it should be an abstract and complex relationship, then use God
The network is to learn this relationship, store it in the model, use the obtained model to speculate on the data outside the training set, and get
to the desired result.
Training (learning) process:
Forward propagation
The input signal propagates from the input layer to the output layer through each hidden layer. The actual response value is obtained at the output layer, if the actual value is consistent with the expected
If the expected value error is large, it will enter the error back propagation stage.
backpropagation
According to the method of gradient descent, the connection weights and thresholds of each neuron are continuously adjusted layer by layer from the output layer through each hidden layer,
Iterate repeatedly until the error of the network output is reduced to an acceptable level, or the preset learning times are reached.

 

 

Generation (Epoch) : Use all the data in the training set to conduct a complete training of the model, which is called "generation training".
Batch size (Batch size) : Use a small sample of the training set to perform a backpropagation parameter update on the model weights. This small sample is called a "batch of data"

 

Iteration ( Iteration ): The process of using a batch of data to update the parameters of the model is called "one training" (one iteration). The result of each iteration will be used as the initial value of the next iteration. One iteration = one forward pass + one reverse pass.
For example, the training set has 500 samples, batchsize = 10 , then train the entire sample set: iteration=50 , epoch=1
Neural Network Training Process
1. Extract feature vectors as input.
2. Define the neural network structure. Including the number of hidden layers, activation functions, etc.
3. Through training, use the back propagation algorithm to continuously optimize the value of the weight to make it reach the most reasonable level.
4. Use the trained neural network to predict unknown data (reasoning). The trained network here refers to the situation where the weight reaches the optimum.
condition.
Neural Network Training Process
1. Select a sample (Ai, Bi) of the sample set, where Ai is the data and Bi is the label (category)
2. Send it to the network and calculate the actual output Y of the network (at this time, the weights in the network should all be random quantities)
3. Calculate D=Bi Y (that is, the difference between the predicted value and the actual value)
4. Adjust the weight matrix W according to the error D
5. Repeat the above process for each sample until the error does not exceed the specified range for the entire sample set
more specific:
1 Random initialization of parameters
2 Forward propagation calculates the activation function value of the output node corresponding to each sample
3 Calculate the loss function
4 Backpropagation to calculate partial derivatives
5 Update weights using gradient descent or advanced optimization methods
Random initialization of parameters
For all parameters, we must initialize their values, and their initial values ​​cannot be set to the same, for example, they are all set to 0 or 1.
If set to the same then all parameters will be equal after update. That is, the functions of all neurons are equal, resulting in a high degree of redundancy. Place
So we have to randomize the initial parameters.
In particular, if the neural network has no hidden layers, all parameters can be initialized to 0. (But this is not called a deep neural network anymore)
Normalization _ _
Reason: Due to the establishment and training of classifiers or models, the range of input data may be relatively large, and each data in the sample may be
The energy class is inconsistent, such data is easy to affect the model training or the construction result of the classifier, so it needs to be standardized
It can be processed to remove the unit limitation of the data, and convert it into a dimensionless pure value, so that indicators of different units or magnitudes can be carried out.
Compare and weight.
The most typical one is the normalization processing of data, that is, to uniformly map the data to the interval [0,1]

 

z-score normalization ( zero-mean normalization ):
The mean value of the processed data is 0, and the standard deviation is 1 (normal distribution)
where μ is the sample mean and σ is the sample standard deviation

 

loss function
The loss function is used to describe the gap between the model prediction value and the real value. There are generally two common algorithms - mean square difference
(MSE) and cross entropy.
Mean Squared Error (Mean Squared Error, MSE), also known as "mean square error":

 

Cross entropy (cross entropy) is also a kind of loss algorithm, which is generally used in classification problems, expressing the meaning of predicting input
The probability of which class the sample belongs to. The smaller the value, the more accurate the prediction result. (y represents the true value classification (0 or 1), a represents
table predicted values):
The choice of loss function depends on the type of input label data:
1. If the input is a real number and unbounded value, the loss function uses MSE.
2. If the input label is a bit vector (classification flag), using cross entropy will be more suitable
Gradient Descent
Gradient f=( x1 ​∂ f; x2 ​∂ f;…; xn ​∂ f) refers to the derivative of the function with respect to the variable x. The direction of the gradient indicates the direction in which the value of the function increases. The gradient
The modulo of represents the rate at which the value of the function increases.
Then as long as the value of the parameter is continuously updated to a certain size in the opposite direction of the gradient, the minimum value of the function (global minimum or local minimum) can be obtained.
Generally, when using the gradient to update the parameters, the gradient will be multiplied by a learning rate (learning rate) less than 1. This is because the modulus of the gradient is often relatively large. Directly using it to update the parameters will cause the function value to fluctuate continuously, and it is difficult to converge to an equilibrium point (this is why the learning rate should not be too large).

 

 

learning rate
is an important hyperparameter that controls how quickly we adjust the neural network weights based on the loss gradient.
The smaller the learning rate, the slower we descend along the loss gradient.
In the long run, this cautious and slow-moving option may be fine because it avoids missing any local optima, but it also means that
This means that it takes us more time to converge, especially if we are at the peak of the curve.
new weight = current weight - learning rate × gradient

 Purple part: the difference between the correct result and the output result of the node, that is, the error;

The red part: the activation function of the node. All the links input to the node are summed up after multiplying the signal passing through it and the link weight, and then the summation result is subjected to the activation function operation;
Green part: the signal value output by the front-end node of the link w(jk).
Classification of generalization ability
Underfitting: The model does not have a structure that can represent the data well, and the fitting degree is not high. The model cannot achieve a low enough error on the training set;
Fitting: The gap between the test error and the training error is small;
Overfitting: The model overfits the training samples, but the prediction accuracy of the test samples is not high, that is to say, the model generalization ability is poor. The gap between training error and test error is too large;
Does not converge: the model is not trained on the training set.

 

overfitting
Overfitting refers to given a bunch of data with noise, using a model to fit this bunch of data may also fit the noise data.
On the one hand, the model will be more complicated;
On the other hand, the generalization performance of the model is too poor. When encountering new data, the accuracy rate of the obtained over-fitting model is very poor.

 

The reason for this:
1. The wrong sample selection method, sample label, etc. are selected for the modeling sample, or the number of samples is too small, and the selected sample data is not enough to represent the predetermined classification rules
2. The noise of the sample is too large, so that the machine regards part of the noise as a feature, which disturbs the preset classification rules
3. The assumed model cannot reasonably exist, or it cannot meet the conditions for the establishment of the assumption
4. Too many parameters lead to high model complexity
5. For the neural network model: a) There may be a non-unique classification decision surface for the sample data. As the learning progresses, the BP algorithm may cause the weight to converge to an overly complex decision surface; b) The number of iterations for weight learning is large enough to fit the noise in the training data and the unrepresentative features in the training samples.
Overfitting solution:
1. Reduce features: delete features that are not relevant to the target, such as some feature selection methods
2. Early stopping
        • At the end of each Epoch, calculate the accuracy of the validation data, and stop training when the accuracy no longer improves.
        • Then one of the key points of this approach is how to think that the validation accuracy is no longer improved? It does not mean that once the validation accuracy drops, it is considered that it will no longer increase, because after this Epoch, the accuracy may decrease, but the subsequent Epoch makes the accuracy rise again, so it cannot be judged that it will no longer increase based on one or two consecutive decreases.
        • The general practice is to record the best validation accuracy so far during the training process. When the best accuracy is not reached for 10 consecutive Epochs (or more), it can be considered that the accuracy is no longer improved. At this point, the iteration can be stopped (Early Stopping).
        • This strategy is also called "No-improvement-in-n", n is the number of Epoch, which can be taken according to the actual situation, such as 10, 20, 30
3. More training samples.
4. Re-clean the data.
5. Dropout
 
        In the neural network, the dropout method is implemented by modifying the structure of the neural network itself:
        1. At the beginning of training, randomly delete some (can be set to 1/2, or 1/3, 1/4, etc.) hidden layer neurons, that is
It is considered that these neurons do not exist, while keeping the number of input layer and output layer neurons constant.
        2. Then follow the BP learning algorithm to learn and update the parameters in the ANN (the units connected by dotted lines are not updated, because it is considered that this
some neurons were temporarily deleted). Such an iterative update is completed. In the next iteration, some , which is different from the last time, and random selection is made. This continues until the end of training.
        The Dropout method is to prevent the overfitting of the ANN by modifying the number of neurons in the hidden layer of the ANN.
Why Dropout can reduce overfitting:
        1. Dropout randomly selects and ignores the hidden layer nodes. In the training process of each batch, the hidden layer nodes randomly ignored each time are different, so that the network is different each time, and each training can be regarded as a "new" model;
        2. Hidden nodes appear randomly with a certain probability, so there is no guarantee that every 2 hidden nodes will appear at the same time every time.
In this way, the update of the weight no longer depends on the joint action of hidden nodes with fixed relationships, which prevents the situation that some features are only effective under other specific features.
        Summary: Dropout is a very effective neural network model averaging method, which averages the predicted probability by training a large number of different networks. Different models are trained on different training sets (the training data for each epoch is randomly selected), and finally the same weights are used in each model to "fusion" .
After cross-validation, the effect is best when the hidden node dropout rate is equal to 0.5.
Dropout can also be used as a way to add noise, operating directly on the input. The input layer is set to a number closer to 1. make
The input change will not be too large (0.8)
The disadvantage of dropout is that the training time is 2-3 times that of the network without dropout.

Guess you like

Origin blog.csdn.net/cyy1104/article/details/131909292