ICLR17 - Temporal fusion for semi-supervised learning "TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED LEARNING"

Article directory

First acquaintance

Multi-model fusion strategies can usually achieve better output than a single network. This idea is also used in single network training, such as dropout, dropconnect and other regularization methods [such methods can be regarded as]. Only a specific subset of the network is adjusted during training, so at test time, the entire network can be viewed as an implicit fusion of this trained sub-network

This article applies this idea to the field of semi-supervised learning (only part of the training data is labeled), using the fused output as a pseudo label [this is closer to the real label than the output of a single network]. And implemented the π model and timing fusion model in two forms respectively , not only improves the performance of semi-supervised image classification, but also improves the performance of fully supervised classification.

friend

π model

The core idea of the π model is to allow one image to undergo different augmentations to obtain two images, which are sent to the network respectively (dropout is used, so it can be regarded as different subnetworks ) to get the output, making the two outputs as consistent as possible(consistent). The training process is as follows, $x_i$ Display import image, $y_i$ Display truth tag (only existing part $x_i$ has label), the image with label will calculate standard 交叉熵loss, followed by 均方根loss for the output of two augmented images through the network, and finally use a Time-related function $w (t)$ I lost my luck.
Insert image description here
Among them, $The setting of w (t)$ is very important , is a function with a slope risingramps up [starting from zero and rising slowly], which means that the early training process is mainly dominated by supervision loss [otherwise, the network will quickly fall into a degenerate solution, resulting in no Classification of meaning].

For specific setting details, please refer to Appendix A of the original text.

Temporal ensembling

Based on the π model, the author further proposed a time series fusion method, and its network structure is as follows. It can be seen that each sample only needs to be sent to the network once for image enhancement, unlike the π model, which requires two samples. In theory, the time efficiency is doubled.
Insert image description here
Why does it only need to be entered once? Because in this process, $\hat{z}_i$ is already the result of fusion, which is implemented by EMA. After each epoch training is completed, it is first updated through the following formula 1 $Z$ , this formula means $Z$ contains the weighted fusion of the previous output results of different networks, where α is the weight term. Furthermore, in order to generate target $\hat{z}$ , also divided by the factor $1-α^t$ to correct the startup deviation [consistent with the principle in Adam].
Insert image description here
Similarly, $w (t)$ is also a starting point 0, time-dependent ramp-up function.

Compared with the π model, the temporal fusion model is faster to train and has better results, but its disadvantage is that it requires additional space to store data and introduces a new hyperparameter α.

Part of the experiment

The following two tables show the comparison of the two models proposed in this article with the SOTA method in semi-supervised settings and fully supervised settings respectively, including the two data sets of CIFAR-10 and SVHN. It can be seen that the training paradigm proposed in this article can bring great benefits whether in semi-supervised or fully supervised settings.
Insert image description here

review

This article is a work published by NVIDIA at the 2017 ICLR conference. It mainly proposes a solution to the semi-supervised image classification task. The idea is very simple. To sum up, it is 一致性约束. The input obtained by changing or jittering the input into the network is constrained by loss to make it consistent. This idea is also reflected in work in many fields. There is also EMA. Looking back now, this trick is really classic. Many current self-supervised learning methods also use this operation (such as Moco, I will write a paper interpretation of the moco series when I have time), in an implicit way. Model fusion is carried out in a stable way.

In general, I like this article very much. It is a typicalsimple and efficient representative work (there are few papers in the semi-supervised field, so it only represents my personal opinion).

You can also refer to this blog (https://www.cnblogs.com/wuliytTaotao/p/12825797.html). It's quite good, and it also extends the method of Mean Teacher: changing the EMA operation from the end of each epoch to the end of each step, and changing the weighting method from labels to models. Weights. Therefore, there are two models, among which the model that updates the weight in real time is Student, and the EMA weighted fusion result of the Student network weight is Teacher, use Teacher’s output to implement consistency constraints.