Detailed label smoothing (Label Smoothing)

1. What is label smoothing?

Label smoothing (Label smoothing), like L1, L2 and dropout, is a regularization method in the field of machine learning, usually used in classification problems, the purpose is to prevent the model from predicting labels with too much confidence during training, and improve poor generalization ability The problem.

Label smoothing converts hard labels into soft labels, making network optimization smoother. Label smoothing is an effective regularization tool for deep neural networks (DNNs) that generates soft labels by applying a weighted average between uniformly distributed and hard labels. It is usually used to reduce the overfitting problem of training DNN and further improve the classification performance.

Of course, there are many corresponding statements here:

  • Hard target和Soft target

  •  hard label 和 soft label

2. Why do you need label smoothing?

For classification problems, we usually think that the target class probability of the label vector in the training data should be 1, and the non-target class probability should be 0. The traditional one-hot encoded label vector yi is,

When training the network, minimize the loss function  , where pi is calculated by applying the Softmax function to the logits vector z output by the penultimate layer of the model,

 

In the network learning process of the traditional one-hot encoded label, the model is encouraged to predict that the probability of the target category approaches 1, and the probability of the non-target category approaches 0, that is, the final predicted logits vector (the output of the logits vector after softmax is the predicted The value of the target category zi in the probability distribution of all categories) will tend to infinity, making the model learn in the direction that the logit difference between the correct and wrong labels is infinitely increased, and an excessively large logit difference will make the model lack of adaptability, Overconfident in its predictions. When the training data is not enough to cover all cases, this will lead to overfitting of the network, poor generalization ability, and in fact some labeled data may not be accurate. At this time, using the cross entropy loss function as the objective function may not be optimal. that's it.

1. The mathematical definition of label smoothing

Label smoothing combines the uniform distribution  to replace the traditional ont-hot encoded label vector with an updated label vector :

Among them, K is the total number of categories of multi-classification, and αα is a small hyperparameter (generally 0.1), that is

In this way, the smoothed distribution of the label is equivalent to adding noise to the real distribution, preventing the model from being too confident about the correct label, so that the difference between the output values ​​​​of the predicted positive and negative samples is not so large, thereby avoiding overfitting and improving the generalization of the model ability.

This paper on NIPS 2019 When Does Label Smoothing Help? uses experiments to illustrate why Label smoothing can work, pointing out that label smoothing can make clusters between categories more compact, increase inter-class distance, reduce intra-class distance, and improve generalization , At the same time, it can also improve Model Calibration (the degree of alignment between the confidences and accuracies of the model for the predicted value). But using Label smoothing in model distillation can lead to performance degradation.

2. Take a chestnut:

For example, there is a six-category classification task, how does CE-loss calculate the current loss of a certain predicted probability p relative to y:

 It can be seen that according to the formula of CE-loss, only the dimension of 1 in y participates in the calculation of loss, and the others are ignored. This has some consequences:

  1. The relationship between the real label and other labels is ignored, and a lot of useful knowledge cannot be learned ; for example, "bird" and "airplane" are relatively similar, so if the model predicts that the two are closer, then a smaller value should be given. the loss;
  2. It tends to make the model more "arbitrary" and become a "black and white" model, resulting in poor generalization performance;
  3. It is more susceptible to confusing classification tasks and noisy (mislabeled) data sets.

In short, this is all caused by the unreasonable representation of one-hot, because one-hot is just a simplification of the real situation.

In the face of the problem of easy overfitting that one-hot may bring, some research has proposed the Label Smoothing method:

Label smoothing is to add a random noise to each dimension of the original one-hot representation. This is a simple and crude, but very effective method, which has been used in many image classification models.

3. Advantages and disadvantages of one-hot and Label Smoothing

1. One-hot disadvantages:

  • May lead to overfitting. The marking method of 0 or 1 leads to the estimated value of the model probability being 1, or close to 1. This encoding method is not soft enough and easily leads to overfitting. Why? The training set used to train the model is usually very limited, and often cannot cover all situations, especially when the training samples are relatively small.
  • It will cause the model to predict it too confidently, causing the model's prediction of the observed variable x to seriously deviate from the real situation

2. Advantages of Label Smoothing:

  • To a certain extent, it can alleviate the problem that the model is too arbitrary, and it also has a certain ability to resist noise
  • It makes up for the problem of insufficient supervision signal (less information entropy) in simple classification and increases the amount of information;
  • Provides the relationship between categories in the training data (data augmentation);
  • May enhance the model generalization ability
  • Reduce the effect of feature norm (feature normalization) so that the samples of each category can be gathered
  • This results in a better calibrated network, which generalizes better and ultimately produces more accurate predictions on unseen production data.

3. Disadvantages of Label Smoothing:

  • Simply adding random noise cannot reflect the relationship between labels, so the improvement of the model is limited, and there is even a risk of underfitting.
  • It is not useful for building networks that will serve as teachers in the future, hard target training will produce a better teacher neural network.

4. In what scenarios does label smoothing have a miraculous effect?

1. Applicable scenarios

Some thoughts on usage scenarios in the NLP field are mentioned:

  1. In real scenarios, especially when the amount of data is large, there will be noise in the data (of course, if you have to say that I am 100% sure that the data is completely correct, then it doesn’t matter), in order to avoid model errors, learn these Noise can be added to label smoothing
  2. Avoid the model being too confident. Sometimes we train a model and find that it gives a very high confidence, but sometimes we don’t want the model to be too confident (may lead to other problems such as over-fit), and hope to increase the learning difficulty of the model. , will also introduce label smoothing
  3. There will be some vague cases in the classification, such as picture classification, some pictures are like cats and dogs, using soft-target can provide supervision effect for both categories

Hinton's [when does label smoothing help? ] paper explains the role of label smoothing from another angle: multi-category may be better, categories are closer, and different categories are separated; small categories may be weaker

There is a parameter epsilon in label smoothing, which describes the degree of label softening. The larger the value, the smaller the label probability value of the label vector after label smoothing, and the smoother the label. Conversely, the label tends to be hard label.

Using label smoothing for larger models can effectively improve the accuracy of the model, and using this method for smaller models may reduce the accuracy of the model.

2. Inapplicable scenarios

Label-smooth generalization benefits the performance of the teacher network, but it transfers less information to the student network.
Although training with label smoothing improves the final accuracy of the teacher, it fails to transfer enough knowledge to the student network (without label smoothing) compared to the teacher trained with "hard" targets. Label smoothing "erases" some of the finer details preserved during hard target training.

The reason why the model produced by label smoothing is a bad teacher model can be more or less shown by the initial visualization. By forcing the final classification into tighter clusters, the network removes more details, focusing on the core distinctions between classes. This "rounding" helps the network better handle unseen data. However, the missing information ends up negatively affecting its ability to teach new student models. Therefore, a teacher with greater accuracy is not better at distilling information to students.

Guess you like

Origin blog.csdn.net/ytusdc/article/details/128503206