Article directory
Recommended reading:
Pseudo-Label: A Simple and Effective Semi-Supervised Approach to
Pseudo-Labeling in Deep Learning, just pick the classes with the largest predicted probability and use them as if they were real labels.
Entropy regularization and entropy minimization: Entropy Minimization & Regularization
Official GitHub source code: google-research/mixmatch
Summary
MixMatch unifies the current mainstream methods of semi-supervised learning, resulting in a new algorithm that guesses low-entropy labels for data-augmented unlabeled examples and mixes labeled and unlabeled data using MixUp.
1. Introduction
Many semi-supervised learning methods calculate a loss term on unlabeled data, and the loss term belongs to one of the following three categories: (1
) entropy minimization: it encourages the model to output confident predictions on unlabeled data;
( 2) Consistency regularization: it encourages the model to produce the same output distribution when its input is perturbed; (
3) Universal regularization: it encourages the model to generalize well and avoid overfitting to the training data.
MixMatch, an SSL algorithm that introduces a single loss, elegantly unifies these mainstream methods for semi-supervised learning.
Figure 1: Schematic of the label guessing process used in MixMatch. Random data augmentation is applied K times to an unlabeled image, and each augmented image is fed to a classifier. The mean of these K predictions is then "sharpened" by adjusting the temperature of the distribution.
2. Related work
2.1 Consistency Regularization Self-consistent regularization
A common regularization technique in supervised learning is data augmentation, which applies input transformations that are assumed not to affect class semantics. For example, in image classification, it is common to elastically deform or add noise to the input image, which can greatly change the pixel content of the image without changing its label
Self-consistent regularization performs data enhancement on unlabeled data, and the generated new data is input into the classifier, and the prediction results should remain self-consistent. That is, for samples generated by the same data enhancement, the model prediction results should be consistent . This rule is added to the loss function
. Note that Augment(x) is a random transformation, so the two terms in equation (1) are not exactly the same.
MixMatch exploits a self-consistent form of regularization by applying standard data augmentation (random horizontal flipping and cropping) to images.
2.2 Entropy Minimization
A common underlying assumption of many semi-supervised learning methods is that the decision boundary of a classifier should not pass through high-density regions of the marginal data distribution. One way to enforce this is to require the classifier to output low-entropy predictions on unlabeled data. There is a loss term which minimizes the entropy of the model predicting unlabeled data.
"Pseudo-Label" achieves entropy minimization implicitly by constructing hard (1-hot) labels from high-confidence predictions on unlabeled data and using them as the training objective of standard cross-entropy loss. MixMatch also implicitly minimizes entropy by using a "sharpening" function on the target distribution of unlabeled data.
2.3 Traditional Regularization
We use weight decay, which penalizes the L2 criterion for the model parameters. We also use MixUp in MixMatch. We leverage MixUp as a regularizer (applied to labeled data points) and a semi-supervised learning method (applied to unlabeled data points). MixUp has been applied to semi-supervised learning before;
3. MixMatch
X is labeled data, U is labeled without data, X_hat is augmented data and is labeled, U_hat is augmented data and unlabeled.
3.1 Data augmentation
Use data augmentation on both labeled and unlabeled data. For each xb in batch X of labeled data, we generate a transformed version xˆb = Augment(xb). For each ub in a batch of unlabeled data U, we generate K augmented versions uˆb, k = Augment(ub), k ∈ (1, …, K).
3.2 Label Guessing
For each unlabeled example in U, MixMatch uses the model's predictions to generate a "guess" for that example's label. This guess is later used in the unsupervised loss term. To do this, we compute the mean of the class distribution predicted by the model over all K increments of ub in the following way.
Sharpening: The variance of the probability distribution is smaller, the prediction results are more consistent, and the system entropy is smaller. In other words: those with higher probability will be pulled higher, and those with lower probability will be pulled lower.
Pseudocode: read in conjunction with Figure 1
3.3 3.3 MixUp
4. Experiments
Comparison of different methods:
:
Contribution of each part of the anatomy (Ablation Test):
Comparison of semi-supervised algorithm results:
5. Conclusion
MixMatch: Summary of innovation points
(1) MixMatch integrates self-consistent regularization, and uses random left-right flipping and cutting of images during data enhancement
(2) MixMatch uses the Sharpening function to minimize the classification entropy of unlabeled data.
(3) MixMatch uses Adam as the optimizer, and uses L2 regularization for weight decay.
(4) MixMatch uses Mixup as the idea of data enhancement.