Improving the performance of any given neural network using ensembles

[AI Guide] In this article, we will discuss two recent interesting papers. The general idea of ​​the paper is to improve the performance of any given neural network through an integration method. The two papers are:

  1. "Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs" proposed by Garipov et al.
  2. "Averaging Weights Leads to Wider Optima and Better Generalization" by Izmailov et al.

Traditional Neural Network Ensemble Methods

The traditional ensemble approach is to integrate several different models, use the same input to make predictions on the models, and then use some averaging method to determine the final prediction of the ensemble model.
Averaging can take simple voting, averaging or even use one of the ensemble models to learn and predict the correct value or label for the input.
Ridge Regression ( Ridge Regression ) is a special integrated prediction method, and it is also a model integration method used by the Kaggle competition champion.
Snapshot ensemble: Save the corresponding models at the end of each learning rate cycle, and then use all the saved models simultaneously to make predictions during model prediction.

When ensemble methods are combined with deep learning, the final prediction results can be produced by combining the predictions of multiple neural networks. Usually, integrating neural networks with different structures will result in an integrated model with good performance, because each model may make mistakes on different training samples, so such an integrated method can maximize the final performance of the model.
insert image description hereinsert image description here
Snapshot Ensemble: Using Cyclic Learning Rates with Annealing Strategies

However, you can also ensemble neural network models of the same structure, and this ensemble can also achieve surprisingly good results. In the snapshot ensemble paper, based on this integration method, the author proposes a great training technique. When training two identical neural networks, adopt the weight snapshot strategy, and create an integrated model with the same structure and different weights after the training. Experiments have proved that the integrated model obtained in this way can improve the final test performance, and this is also a very simple method. You only need to train one model at a time, which greatly reduces the time cost of calculation.

If you have not used the cyclic learning rate strategy in training, then you must learn to use it, because it is the most advanced training technique at present, it is very easy to implement, the amount of calculation is not large, and there is almost no additional cost. Significant results can be obtained.

Above, all the examples I introduced are based on model space integration methods, that is, by combining several or several models and integrating the predictions of a single model to produce the final prediction result.

And in the paper that will be discussed in this article, the author proposes a new ensemble method based on the weight space. This approach ensembles models by combining weights from the same network at different training stages, and then uses this ensemble of combined weights to make predictions. This approach has two advantages:

  • After training we only end up with an ensemble model with combined weights, which speeds up subsequent model predictions. When using combined weights, we end up with a model after training that speeds up subsequent prediction stages.
  • Experimental results show that this ensemble method of combining weights beats the current state-of-the-art snapshot ensemble method

Below, we'll take a closer look at how it works. But before that, we need to know something about the loss surface ( loss surface ) and the generalization problem ( generalizable solution ).

Solutions in weight space

The first important insight is that a trained network is actually a point in a multidimensional weight space. For a given model structure, each different combination of network weights produces a separate model. Since any model structure has an infinite number of combinations of weights, there will be an infinite number of solutions. The goal of training a neural network is to find a specific, weight-space-oriented solution that minimizes the value of the loss function on the training and test datasets.

During training, by changing the weights, the training algorithm will change the network structure and explore solutions in the weight space. The gradient descent algorithm propagates over the loss plane, and the elevation of the plane is given by the value of the loss function.

Local optimal solution and global optimal solution

Visualizing and understanding the geometric properties of multidimensional weight spaces is very difficult. At the same time, this is also very important, because essentially, during training, the stochastic gradient descent algorithm is going through the loss plane in this highly multidimensional space, and trying to find a good solution, that is, exploring a loss value on the loss plane The lowest "point". As we all know, such a loss plane will have many local optimal solutions, but not all local optimal solutions will be the global optimal solution.

Hinton once said: "When dealing with hyperplanes in 14-dimensional space, you can imagine a 3-dimensional space and tell yourself out loud "this is a 14-dimensional space". Everyone can do this." Local
insert image description here
Optimum solution and the global optimal solution. Minima in the flat region produce similar losses during training and testing, but reducing the loss produces very different results during training and testing. In other words, global minima are more general than local minima.

One metric that can differentiate a good solution from a bad one is its flatness. Because the model will produce similar but not identical loss planes on the training dataset and the testing dataset. You can imagine that the test loss plane will be slightly offset from the training loss plane. For a local optimum, points with lower loss may have large loss values ​​due to this transition during testing, which means that this local optimum is not very generalizable, i.e. in The loss is low during training, but high during testing. On the other hand, for the globally optimal solution, this shift will cause the training and testing losses to be close to each other.

Above, I explained the difference between local optimal solutions and global optimal solutions, because the new method that this article focuses on will involve global optimal solutions.

Snapshot Ensembling (snapshot integration)

At the beginning of training, SGD will produce a large jump in the weight space. Subsequently, due to the cosine annealing strategy that makes the learning rate gradually smaller, SGD will converge to a local optimal solution, and the model is added to the set through snapshot ensemble to realize the integration of the model. The learning rate will then be reset to a larger value, and SGD will again take a large jump before the model converges to some different local optimum.

The period length of the Snapshot ensembling method is 20 to 40 iterations. Long cyclic learning rates are able to find as dissimilar models as possible in the weight space. If the models are too similar, then the predictions of the individual networks in the ensemble will be too close, which will make the advantage of the ensemble negligible.

The results achieved by the Snapshot ensemble method are very good, it can greatly improve the model performance, but the effect of the Fast Geometric Ensembling method is better than that of the Fast Geometric Ensembling method.

Fast Geometric Ensembling (FGE) (fast geometric integration)

Fast Geometric Ensembling (FGE) is very similar to the Snapshot Ensembling method, but it has some notable properties. It uses a linear piecewise cyclic learning rate instead of cosine in snapshot ensembling. Second, the cycle length of FGE is much shorter than that of snapshot ensembling, with only 2 to 4 iterations per cycle.

Intuitively, we might think that short periods are wrong, because the models at the end of each period will be close to each other, and combining them will not bring any benefit. However, as the authors discovered, because there are low-loss connection paths between disparate models, it is possible to travel along these paths in small steps, integrate the encountered models and obtain good result. Therefore, compared with the snapshot ensembling method, FGE shows its improvement and can get the model we want with fewer steps, which also makes the training speed faster.
insert image description hereLeft: Traditional intuition holds that local minima are separated by regions of high loss values. The same is true if we explore along the path of local minima. Middle and right panels: indeed there are some paths with lower loss values ​​between local minima. FGE generates an ensemble model along these paths.

To take full advantage of snapshot ensembling or FGE methods, we need to store multiple training models, then make predictions for each model and average the final predictions. Therefore, in order to obtain better integration performance, more calculations need to be paid, which is the embodiment of the "no free lunch" law, and is also the motivation of this "random weighted average" paper.

Stochastic Weighted Average (SWA)

Stochastic Weighted Average (SWA) is very close to the FGE method, but with a small computational penalty. SWA can be applied to any model structure and dataset and shows good results in these datasets. This paper shows that SWA tends more toward a global minimum, which has the advantages I discussed above. SWA is not an integration approach in the traditional sense. At the end of the training, you will get a model, the performance of this ensemble model will be better than snapshot ensemble and FGE.
insert image description here

Left panel: W1, W2 and W3 represent 3 independently trained networks, and Wswa is their average. Middle panel: Compared with SGD, Wswa shows better performance on the test set. Right: Note that while Wswa exhibits worse loss during training, it generalizes and generalizes better.

The point of view of SWA comes from the empirical observation that the local minimum at the end of each learning rate cycle tends to accumulate at the boundary of the area with low loss value on the loss plane (such as the points W1, W2 and W3 in the left figure are located in the red area at the border). By averaging the loss values ​​of such points, a global optimal solution with lower loss values, better generalization and versatility can be obtained (such as Wswa in the left figure above).

Here's how Wswa works. You only need two separate models instead of ensembles with many models:

  • The first model is used to store the average value of model weights (such as w_swa in the formula). This will result in the final model after training and use it for prediction.
  • The second model is used to traverse the weight space (as w in the formula) and explore using a cyclic learning rate.

Weight update equation for stochastic weight averaging: insert image description hereAt the end of each learning rate epoch, the current weights of the second model will be updated by taking a weighted average between the old average weights and the new set of weights for the second model The average weight of the model (the formula is shown on the left). Following this approach, you only need to train one model and store both models in memory during training. In the prediction stage, you only need the model with the average weights, and make predictions on it, which is much faster than using the above ensemble methods that require multiple models to make predictions.

epilogue

The authors of this paper have open-sourced an implementation of this paper on PyTorch. Also, there is an implementation of SWA in the awesome fast.ai library, everyone can use it.

reference:

1. Stochastic Weight Averaging — a New Way to Get State of the Art Results in Deep Learning
2. The latest method of deep learning: Snapshot Ensembling and OUT! Random weighted average is the future! ! !

Guess you like

Origin blog.csdn.net/Joker00007/article/details/130855037