Into the understanding of deep learning - regularization (Regularization): early termination (Early Stopping)

Category: General Catalog of "In-depth Understanding of Deep Learning"


When training large models with sufficient representation power to even overfit, we often observe that the training error gradually decreases over time but the validation set error rises again. The image below is an example of these phenomena, which are almost certain to occur.
gradually decreases over time but the error on the validation set rises again

This means that we can obtain a model with lower validation error (and thus hopefully better test error) simply by returning to the parameter setting that results in the lowest validation error. After each validation error improvement, we store a copy of the model parameters. When the training algorithm terminates, we return these parameters instead of the latest ones. The algorithm terminates when the error on the validation set does not improve further within a prespecified number of iterations. This strategy is called early termination (Early Stopping). This is probably the most commonly used form of regularization in deep learning. Its popularity is mainly due to its effectiveness and simplicity.

We can consider early termination to be a very efficient hyperparameter selection algorithm. In this view, the number of training steps is just another hyperparameter. We can see from the figure above that this hyperparameter has a U-shaped performance curve on the validation set. Many hyperparameters that control model capacity have such a U-shaped performance curve on the validation set. In the case of early termination, we control the effective capacity of the model by controlling the number of steps to fit the training set. Most hyperparameter selection must use a costly guess-and-check process, where we need to guess a hyperparameter at the beginning of training and then run several steps to check its training effect. ''Training time'' is the only hyperparameter for which you can try many values ​​with just one training run. The only significant cost of automatically selecting hyperparameters via early termination is the regular evaluation of the validation set during training. Ideally, this could be done in parallel on a separate machine from the main training process, either on a separate CPU, or on a separate GPU. Without these additional resources, it is possible to use a smaller validation set than the training set or evaluate the validation set less frequently to reduce the evaluation cost and obtain a rougher estimate of the optimal training time.

Another additional cost of early termination is the need to keep an optimal copy of the parameters. This cost is generally negligible because it can be stored on slower, larger memory (e.g. train in GPU memory but store optimal parameters in main memory or on disk drive). Since the best parameter writes happen infrequently and are never read during training, these sporadic slow writes have little impact on the total training time.

Early termination is a very unobtrusive form of regularization that requires little change to the underlying training procedure, objective function, or set of allowed parameter values. This means that early termination can be easily used without breaking the learning dynamics. With respect to weight decay, care must be taken not to use too much weight decay, in case the network gets stuck in bad local minima (corresponding to pathologically small weights). Early termination can be used alone or in combination with other regularization strategies. Even when regularization strategies are used to improve the objective function to encourage better generalization, it is very rare to achieve the best generalization at local minima of the training objective.

Early termination requires a validation set, which means that some training data cannot be fed to the model. To make better use of this extra data, we can run additional training after the initial training that was terminated early. In the second round, an additional training step, all training data is included. There are two basic strategies that can be used in the second round of training process. One strategy is to initialize the model again, then train again using all the data. During this second round of training, we use the optimal number of steps determined by the first round of early termination of training. There are some nuances to this process. For example, we have no way of knowing whether it is better to update the parameters the same number of times or pass the dataset the same number of times when retraining. As the training set becomes larger, in the second round of training, the parameters will be updated more times each time the data set is traversed. Another strategy is to keep the parameters obtained from the first round of training, and then use the entire data to continue training. At this stage, there is no validation set to guide how many steps we need to stop after training. Instead, we can monitor the average loss function on the validation set and continue training until it is below the target value at which the early termination process terminates. This strategy avoids the high cost of retraining the model, but does not perform as well. For example, the goal of the validation set does not necessarily achieve the previous goal value, so this strategy does not even guarantee termination. Early termination is also useful to reduce the computational cost of the training process. In addition to the significantly reduced computational cost due to limiting the number of training iterations, it also brings the benefit of regularization (no need to add a penalty term to the cost function or to calculate the gradient of such an additional term).

So far, we have stated that early termination is a regularization strategy, but we have only supported this claim by showing that the learning curve for validation set error is a U-shaped curve. We believe that early termination can limit the parameter space of the optimization process to the initial parameter value θ 0 \theta_0i0within a small neighborhood. More specifically, imagine using a learning rate ϵ \epsilonϵ forτ \tauτ optimization steps (corresponding toτ \tauτ training iterations). We can putϵ τ \epsilon\tauϵ τ as a measure of effective capacity. Assuming that the gradient is bounded, limiting the number of iterations and learning rate can be limited fromθ 0 \theta_0i0The size of the parameter space arrived, as shown in the figure below. In this sense, ϵ τ \epsilon\tauϵ τ acts as if it were the inverse of the weight decay coefficient. In fact, in the case of a simple linear model of quadratic error and simple gradient descent, we can show that early termination is equivalent toL 2 L^2L2 regularization.
: Schematic diagram of the effect of early termination

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015
[2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J]. arXiv preprint arXiv:2106.11342,

Guess you like

Origin blog.csdn.net/hy592070616/article/details/130813554