Loss function and regularization

More than 90% of this article comes from an article on neural networks at Stanford University, and records the learning content for future review.

table of Contents:

  • Loss function
  • Regularization
  • Application of regularization in loss function
  • Softmax and SVM
  • Cross entropy
  • Maximum Likelihood Estimation (MLE)
  • to sum up

First, the loss function

This article will use an example to understand what is the loss function: For
Insert picture description here
the specific meaning of the parameters in this article, please refer to the neural network learning and summary article, but I will not repeat it here.
From the figure we can find that this particular set of weights W is not very effective, giving the cat a very low score. We will use the loss function (sometimes called cost function or objective function) to measure our dissatisfaction with the results. Intuitively, if we do not do well in classifying training data, the loss will be large; if we do well, the loss will be small.
There are several ways to define the loss function:
MSE (Mean Squared Error):
Insert picture description here
MSE is easier to use in linear regression problems.
Multiclass Support Vector Machine (SVM):
This function can ensure that each image scores the correct class higher than the incorrect class Δ. For different samples, the loss function of the i-th sample is defined as:
Insert picture description here
give an example to understand how this function works. Suppose we have three categories and get three scores s = [13, -7, 11], and the first category is the correct category. At the same time, we assume a variable Δ (a hyperparameter) defined above = 10. Substituting the above loss function, we get two items:
Insert picture description here
It's not difficult to figure out that the first item gets a 0, and the second item gets 8. Since the correct class score (13) is at least 10 greater than the wrong class score (-7), the loss to this combination is zero. The difference is actually 20, which is much greater than 10, but the SVM only cares about the difference being at least 10; any additional differences beyond the boundary will be limited to zero by the maximum operation. The second calculation [11-13 + 10] gives 8. That is to say, even if the score of the correct class is higher than that of the wrong class (13> 11), it will not be higher than the expected score by 10. The difference is only 2, which is why the loss is 8 (that is, how high the difference is to reach the difference). In short, the SVM loss function should correct the score of the class that is easily compared to the wrong class score by at least Δ. If this is not the case, we will accumulate losses.
For neural networks, we use a linear equation to obtain the score:
Insert picture description here
So, we can redefine the loss function format: Insert picture description here
where wj is the jth row vector of the matrix W pair (the vector corresponding to the jth category) transformed into Column vector. Finally, through a picture, the above formula can
Insert picture description here
be more intuitively understood: all sample loss functions are integrated into a whole loss function:
Insert picture description here
Softmax: will be introduced in the following.

Second, regularization

The following are several common methods of
regularization : (1) L2 regularization:
L2 regularization is the most common way of regularization. It does this by penalizing the square terms of all parameters, that is, adding an item 1/2 λ w² to the loss function, where 1/2 is used to apply λw instead of 2λw for each w when applying gradient descent. L2 regularization has an attractive feature, which is to encourage all input in the network to have a part of it, instead of using some input in a large amount. Finally, please note that in the gradient descent method parameter update, the use of L2 normalization ultimately means that each weight is linearly decaying: W + =-λ * W until 0.
(2) L1 regularization The
increase in L1 regularization is λ | w |, and we can sometimes use L1 and L2 at the same time. L1 has a very excellent property, it can be used in the optimization process, so that the weight vector becomes sparse. In other words, L1 can eventually make the most important sparse subset input, ignoring noise. In contrast, the final weight vector of L2 regularization is usually scattered, small numbers. In practice, if you don't care about explicit feature selection, L2 regularization performs better than L1.
(3) Max norm constraints
impose an absolute upper bound on the size of the weight vector of each neuron, and use gradient descent to impose constraints. In practice, we update the parameter w normally, but force the parameter w to be in a certain range. Its characteristic is that even when the learning rate is set too high, the network will not "explode" because the update is always limited .
(4) Dropout is
a simple method to prevent neural network from overfitting (p, which supplements other methods (L1, L2, maxnorm). When training, it is activated by maintaining a neuron with a certain probability p (hyperparameter) , Or set it to 0. The learning difficulty is relatively high, specific implementation can refer to Dropout: A Simple Way to Prevent Neural Networks

Third, the application of regularization in the loss function:

Suppose we have found a weight matrix that is 0 for all category loss functions. At this time, a problem may occur. The weight matrix is ​​not strictly unique, and many W may appear to make the sample classification correct.
The first thing to clarify here is that if some parameters W correctly classify all the examples, then any of these parameters is multiplied by λ (λ> 1) because this transformation extends all fractions uniformly. For example, if the score difference between a correct class and the closest incorrect class is 15, then multiplying all elements of W by 2 will result in a new difference of 30.
How do we find the only certain weight matrix W, here we need to add a regularized loss function R (W) to
Insert picture description here
the loss function. The complete loss function expression is: The
Insert picture description here
overall expansion is:
Insert picture description here
where N is the number of training samples and λ is the hyperparameter . Regularized punishment has many excellent properties. The most attractive feature is that punishment with larger weights helps to improve the generalization ability, which means that no input dimension can have a very large impact on the score alone.
Suppose we have an input vector x = [1,1,1,1], two weight vectors w1 = [1,0,0,0], w2 = [0.25,0.25,0.25,0.25]. Then wT1x = wT2x = 1 so the two weight vectors get the same dot product, but the L2 loss of w1 is 1. The L2 loss of w2 is 0.25. Therefore, according to the L2 penalty, the weight vector w2 is preferred because it achieves a lower regularization loss. Intuitively, this is because the weights in w2 are smaller and more dispersed. Since the L2 penalty tends to be smaller and more dispersed weight vectors, the final classifier is encouraged to consider all input dimensions instead of focusing on only a few input dimensions. This plays an important role in improving the generalization performance of the classifier and preventing overfitting.

4. Softmax and SVM

Softmax: If you have ever heard of a binary logistic regression classifier, then Softmax classifier is a generalization of it, with a multi-class classifier. Unlike the support vector machine which takes the output f (xi, W) as the score of each class (uncalibrated and may be difficult to interpret), the Softmax classifier provides a more intuitive output (normalized class probability). First look at the form of the loss function: The
Insert picture description here
goal of the Softmax classifier is to minimize the cross entropy between the probability of the class to be tested and its true attributes. Among them, the cross entropy hopes that the prediction distribution is all concentrated on the correct answer. First observe the following expression:
Insert picture description here
Insert picture description here

It can be interpreted as the (normalized) probability assigned to the correct label yi given the image xi and parameterized by w. To understand this, remember that the Softmax classifier interprets the score in the output vector f as a non-normalized log probability. Therefore, powering these quantities yields the (unstandardized) probability, and then divides so that the sum of the probabilities is 1.
From a probabilistic perspective, we minimize the negative log-likelihood of the correct class, which can be interpreted as performing maximum likelihood estimation (MLE) . For details, see the fifth module, Maximum Likelihood Estimation. A good feature of this method is that we can
explain that the regularization term R (W) in the complete loss function comes from a Gaussian prior weight matrix W, in which we are performing the maximum posterior (MAP) It is estimated that the principle of selecting the log function here is cross entropy. Readers who want to dig into it can refer to it ( the role of cross entropy in machine learning ). This article will briefly introduce it in the next module.

If you want to implement the Softmax function, you should pay attention to the process of writing the code. Due to the existence of the exponent, both the numerator and the denominator may be very large numbers. The operation of large numbers is unstable. Here, the parameter C is introduced to solve this problem. The expression is transformed as follows:
Insert picture description here
we can freely choose the value of c, which will not change any results, but we can use this value to improve the calculation. Numerical stability. If the maximum value in the function result is selected as C (the most commonly used), it can be well understood through the code:

f = np.array([123, 456, 789]) 
p = np.exp(f) / np.sum(np.exp(f)) # 出现数值不稳定问题

# 将最大值转换为0
f -= np.max(f) # f ——> [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # 结果不变,且过程安全

The following is a more intuitive understanding of the difference between the two classifiers:
Insert picture description here
SVM:
These scores are interpreted as class scores, and their loss function encourages the correct class (blue indicates class 2) to obtain higher scores than other class .
Softmax classifier:
interpret the score of each class as a (non-standardized) log probability, and then encourage a high (normalized) log probability of the correct class (equivalent to a low negative number).
Unlike support vector machine calculations where the scores of all classes are not calibrated and are not easy to interpret, Softmax classifier allows us to calculate the "probabilities" of all labels. The probability is quoted here because the final probability size is also affected by the regularization parameter λ in the loss function. If the regularization intensity λ is high, the penalty of the weight W will also increase, which will cause the weight to become smaller and make the probability distribution more dispersed.

Five, cross entropy

(Excerpt from the role of cross entropy in machine learning ) A
simple summary is: the neural network classification problem is summarized into a single classification problem and a multi-classification problem: (for example)
single classification problem: continue to follow the example at the beginning of the article, assuming that the target type is three These are cats, dogs and boats. Assuming that there is now a sample of cats, the corresponding predicted value is [0 (dog), 1 (cat), 0 (ship)], assuming that the P obtained by the above calculation is [0.3, 0.6, 0.1], known crossover The entropy formula is:
Insert picture description here
the loss function (cross entropy) can be obtained according to the formula:
Insert picture description here
multi-classification problem:
continue to use the example at the beginning of the article, assuming that the target type is three, namely cat, dog and ship. Assuming that there is now a sample of cats, the corresponding predicted value is [0 (dog), 1 (cat), 0 (ship)], assuming that the P obtained by the above calculation is [0.3, 0.6, 0.1], and single classification The label of the problem is different, the label of the multi-category is n-hot. For example, a picture can have both a dog and a cat, so it is a multi-classification problem.
Pred here is no longer calculated by softmax, sigmoid is used here. Normalize the output of each node to [0,1]. The sum of all Pred values ​​is no longer 1. In other words, each Label is distributed independently and has no effect on each other. Therefore, cross entropy is calculated separately for each node, and each node has only two possible values, so it is a binomial distribution. As mentioned earlier, for a special distribution such as the binomial distribution, the calculation of entropy can be simplified.
Similarly, the calculation of cross entropy can also be simplified, that is,
Insert picture description here
the loss of a single sample is loss = losscat + lossdog + loss boat

Sixth, maximum likelihood estimation

( Reference article )
Statement: Some of the original content of the author is directly copied here, the purpose is to compare with this article, to give the reader a clearer understanding.

In practical problems, the data we can obtain may only have a limited number of sample data, and the prior probability and the class conditional probability (the overall distribution of various types) are unknown. When classifying based on the only sample data, a feasible method is that we need to estimate the prior probability and the class conditional probability first.

The estimation of the prior probability is relatively simple, because the natural state to which each sample belongs is known to the neural network (supervised learning). Estimation of class conditional probability (very difficult), the reasons include: the probability density function contains all the information of a random variable, and the sample data may not be much; the dimension of the feature vector x may be very large and so on.

Maximum likelihood estimation is a parameter estimation method . Of course, the selection of probability density function is very important, the model is correct, when the sample area is infinite, we will get a more accurate estimate. The purpose of maximum likelihood estimation is to use known sample results to infer the parameter values ​​that most likely (maximum probability) lead to such results.

In the neural network, the target parameter value obtained by the maximum likelihood estimation is the weight matrix W, and the class conditional probability is (corresponding to the cross entropy): the
Insert picture description here
step of solving the maximum likelihood function:

  1. ML estimation: Find the value of θ that maximizes the probability of the set of samples.
    Insert picture description here

2. In order to facilitate analysis in practice, a log-likelihood function (Softmax's loss function (cross entropy))
Insert picture description hereInsert picture description here
is defined : Since the second half of Li is a fixed value, we only need to optimize Li = -fyi.
3. There are multiple unknown parameters (θ is a vector),
then θ can be expressed as an unknown vector with S components:
Insert picture description here
remember the gradient operator:
Insert picture description here

If the likelihood function satisfies the condition of continuous derivability, the maximum likelihood estimator is the solution of the following equation.
Insert picture description here

The solution of the equation is only an estimated value, and it will be close to the true value only when the number of samples tends to be infinite.

For neural networks, the maximum likelihood estimation seeks the gradient operator, that is, the gradient descent optimization of the loss function. The specific content of this article does not focus on discussion, you can refer to my neural network learning and summary . Finally, a neural network is used to calculate the posterior probability of the sample for prediction and classification.

6. Summary

This article mainly introduces the two most common classifiers SVM and Softmax to explain the loss function in the neural network, from which we found that the definition of the loss function is that a good prediction of the training data is equal to a small loss.
The performance difference between support vector machine and Softmax is usually very small, and different people will have different views on which classifier works better. The Softmax classifier is never completely satisfied with the score it generates: the correct class always has a higher probability, and the wrong class always has a lower probability, and the loss is always better. However, once the boundary is met, the SVM will be happy, and it will not micro-manage accurate scores that exceed this limit.
In addition, it also introduces some commonly used formulas for regularization and the specific steps of maximum likelihood estimation. After the loss function learning is over, it is the optimization problem. The optimization problem in neural networks is often implemented by the gradient descent method. For details, see the understanding and learning about the gradient descent method .

Published 16 original articles · Like1 · Visits 380

Guess you like

Origin blog.csdn.net/qq_41174940/article/details/104189504