Read the semi-supervised learning model in one article (Semi-Supervised Learning)

1. The overall framework of semi-supervised learning

Please add a picture description

2. Consistency regularization model

The algorithm aims at: For the same unlabeled image, the prediction value of a model should be consistent before and after adding additional noise to the image.

Methods of adding noise, such as image enhancement (spatial dimension enhancement, pixel dimension enhancement).

Likewise, dropout can introduce noise into the model structure.

insert image description here

1. Π model

Paper link: TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED LEARNING

  1. Perform two data enhancement transformations on an image at two different scales (mainly to increase noise information).
  2. The two images with different noise enhancements are sequentially input into the same dropout model (two forward operations are performed).
  3. Judging the consistency difference of the distribution of feature extraction results of two noise-enhanced images (for example, using methods such as KL divergence), if the original image has label data, it also judges the loss of cross-entropy information with the real label.
  4. The distribution consistency difference loss and the cross-entropy information loss are weighted and summed to obtain the final loss, and then the gradient backpropagation is performed.

insert image description here

Disadvantages of the model : Each input requires two forward passes to calculate the consistency loss, which may be less efficient.

2. Temporal ensemble model

Link to the paper: TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED LEARNING
This algorithm generates a standard value for comparison during training, and each time the output of the model is compared with the standard value for distribution consistency difference.

  1. Add noise to an image.
  2. Feed the noised image into the model (only one pass forward).
  3. Compare the distribution consistency between the extraction result of the model and the generated standard value. If there is a label in the original image, the difference between the cross entropy information and the label is also calculated.
  4. The distribution consistency difference loss and the cross-entropy information loss are weighted and summed to obtain the final loss, and then the gradient backpropagation is performed.

Note: The standard value is not fixed in each forward propagation, but calculated by the exponential moving average (EMA) algorithm (calculated using the results of the previous epoch). Regarding the EMA algorithm, the formula is as follows.

insert image description here

insert image description here

Model disadvantages:

  • Since each target is only updated once per epoch, the learned information is incorporated into the training process at a slower rate.
  • The larger the dataset, the longer the time span for updates, and in the case of online learning, it is not clear how to use temporal models.

3. Mean teacher model

Link to the paper: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
In this algorithm, the teacher model is different from the student model, and the weight is not updated for each step, but the student model is trained several times continuously The average weight generated after step is used as the update weight. The teacher model uses the student model's EMA weights instead of sharing weights with the student model. The paper states that averaging model weights tends to produce a more accurate model than using the final weights directly. It can aggregate all the previously learned information immediately after each step is completed, not just at each epoch. Furthermore, since EMA improves the output quality of all layers, not just the output of the last layer, the model is able to better represent middle-level and even high-level semantic information. These aspects give rise to two advantages of this model over Temporal ensembles. First, more accurate labels enable a faster feedback loop between the student and teacher models, resulting in better accuracy. Second, this method is suitable for large datasets and online learning.

  1. Two different types of noise are applied to the data.
  2. The two images enhanced with different noises are input to the student model and the teacher model respectively.
  3. Use softmax activation on the output to calculate the consistency loss function between the output of the student model and the output of the teacher model. If the original image has a label, it also calculates the cross-entropy information loss with the label.
  4. Use the gradient descent method to update the weight of the student model, and the weight of the teacher model is updated with the exponential moving average method (EMA) of the student weight (one part of the weight comes from the historical teacher model weight, and the other part comes from the weight of the current step student model):
    insert image description here

insert image description here

Model disadvantages :

  • The choice of the teacher model is crucial to the model performance. Choosing an inappropriate teacher model may lead to poor performance. Therefore, some experiments are needed to determine the best teacher model.
  • The Mean Teacher model still needs some labeled data for supervised training, especially in the early stages. This means it is not suitable for fully unsupervised situations.
  • Mean Teacher assumes that labeled and unlabeled data are generated from the same data distribution. If this assumption does not hold, model performance may suffer.
  • The Mean Teacher model is sensitive to noise and outliers in the data. This can cause the model to perform poorly in the presence of outlier data.

4. UDA model

Paper link: Unsupervised Data Augmentation for Consistency Training
UDA (Unsupervised Data Augmentation unsupervised data enhancement) is a semi-supervised learning algorithm proposed by Google in 2019. This algorithm changes the data enhancement method of the previous algorithms. It does not perform data enhancement on label data, enhances on unlabeled data, and adopts different enhancement methods for different types of data. For example, image classification tasks use random automatic enhancement. , the text classification task uses back translation (translate the text into other languages, and then translate it back to English), TF-IDF (word replacement).

  1. Input the image with the label into the model to get the prediction result, and compare it with the label to calculate the cross-entropy information loss.
  2. Two different enhancements are performed on the unlabeled image, which are input into the model successively, and the KL divergence loss of the two results is compared.
  3. The KL divergence loss and the cross-entropy information loss are weighted and added, and then backpropagated.

insert image description here

Model disadvantages :

  • UDA usually requires a large amount of unlabeled data for data augmentation and self-supervised learning training. Without sufficient unlabeled data, the model may not fully benefit from UDA, which limits its applicability.
  • The performance of UDA is highly dependent on the design of self-supervised learning tasks. Choosing an inappropriate self-supervised task may lead to performance degradation. Therefore, self-supervised tasks and corresponding data augmentation strategies need to be carefully selected.
  • The performance of UDA may vary greatly across different domains and tasks. A UDA model that performs well in one domain may not work well in another. This means that UDA may need to be adjusted and retrained on different tasks and domains.
  • UDA focuses on improving the performance of unlabeled data, but in practical applications, sometimes the quality and quantity of labeled data are still critical factors. If labeled data is insufficient or inaccurate, UDA may not adequately compensate for these issues.

3. Pseudo-label model

The purpose of this algorithm is to first use labeled data to train the model, then use the model to predict unlabeled data, and use the prediction results with high confidence as pseudo-labels to participate in model training and adjustment. The model is trained on both real-label and pseudo-label data.

insert image description here

1、self-training

Link to the paper: Effective Self-Training for Parsing

The main idea of ​​self-training is to use the existing label data to train a model, and then use it to predict the unlabeled data, and use the prediction results with high confidence as the training set to join the training.

  1. Use labeled data to train a task model.
  2. Use this model to make predictions on unlabeled data.
  3. Select the prediction result with confidence higher than a certain threshold as the pseudo label corresponding to the data without label.
  4. Add the pseudo-sample pairs to the original training set and delete them in the unlabeled data set until there are no more optional pseudo-samples (that is, until there are no more prediction results above the threshold, the data set will not change).
  5. Repeat the above steps.

insert image description here

2、co-training

Link to the paper: Combining labeled and unlabeled data with co-training

Co-training training is a multi-view semi-supervised algorithm. The main idea is to use two models to label each other, and use the result of high confidence in the prediction results of the other party and the result of low confidence in itself as its own pseudo-sample.

  1. Divide the labeled data into two parts, and train two models separately.
  2. Use both models to make predictions on the same unlabeled data.
  3. If the prediction result of model 1 is higher than the set confidence threshold, but model 2 is lower than the threshold, use the prediction result of model 1 as a pseudo sample of the model, add it to the training set of model 2, and delete it in the unlabeled data set corresponding data. vice versa. Until the data set does not change any more, the prediction result of any model is no longer higher than the threshold.
  4. Repeat the above steps.

insert image description here

3、Tri-Training

Link to the paper: Tri-training: exploiting unlabeled data using three classifiers

Tri-training is a further adjustment to co-training, which is also a divergence-based multi-view method.

  1. Use the bootstrap method (random sampling with replacement) to select three sub-datasets from the labeled dataset. Three different base models 1, 2, and 3 are trained using three sub-datasets.
  2. For model 1, use models 2 and 3 to predict all unlabeled data sets, assuming that the prediction results of models 2 and 3 are consistent, then use the prediction results as the pseudo-label of model 1, add them to the corresponding data set of model 1, and Delete the corresponding data in the data set without label. Until the prediction results of all the remaining unlabeled data of models 2 and 3 are inconsistent, the data set will not change.
  3. Perform step 2 for each of the three models and retrain the models using the three augmented datasets.
  4. Repeat steps 2 and 3 until the model converges.

Note: When pseudo-labeling unlabeled data, it may be mislabeled, that is, adding noise to the data set. But when the newly added data is enough, the impact of noise can be offset.

insert image description here

4、Curriculum Labeling

Link to the paper: Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

The main idea of ​​Curriculum Labeling is that the model starts to learn from easy samples, and gradually advances to complex samples and knowledge.

The confidence of the pseudo-labels predicted by the model is assumed to follow a Pareto distribution. Compared with the fixed confidence threshold, the algorithm arranges all labels according to the confidence in each iteration round, and then selects them according to Percentile. Percentile gradually increases with round iterations, from 20%–>100%, to After all the pseudo-labels are put into training, the training is stopped.
Compared with fine-tuning the model in each training iteration round, the algorithm directly initializes the parameters (re-start) in each iteration round, which also helps to avoid the accumulation of wrong labels in the early training process that misleads the training and leads to concept bias. Move (concept drift).

insert image description here

4. Holistic Model

Integration of Consistency Regularized Models and Pseudo-Labeling Models.

1、FixMatch

Link to the paper: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

FixMatch is a Holistic semi-supervised learning method proposed by Google Brain.

  1. First, use the labeled data to train a model, calculate the loss between the predicted result and the real label, and obtain the supervised loss.
  2. For the label data, two methods of weak enhancement (translation, flip, etc.) and strong enhancement (cutout, pixel distortion, etc.) are used to add noise, and the two are respectively sent to the model trained in the first step.
  3. For the prediction results of weak enhancement, select the category with the highest confidence as the real category, and make it as a one-hot pseudo-label, that is, the result of the weak period is considered as the real result, and the loss calculation is performed on it and the prediction result of strong enhancement to obtain a non- Supervise loss. Unlike the previous SSL algorithm, FixMatch uses cross-entropy loss in the unsupervised part, because here we consider the result of weak enhancement to be the real label.
  4. The supervised loss and unsupervised loss are weighted and fused to obtain the final loss, which is backpropagated to update the model. λu is the weight of the loss corresponding to the unlabeled data.

insert image description here

2、Semi-ViT

Link to the paper: Semi-supervised Vision Transformers at Scale

Semi-ViT is a Transformer-based large-scale semi-supervised vision algorithm proposed by Amazon in 2022.

  1. First, use the MAE masked self-training model to pre-train on all data (including label and label-free data).
  2. Extract the VIT (encoder part) in the trained MAE, and then use the labeled data to fine-tune the model.
  3. Input the unlabeled data into the fine-tuned model. If the highest confidence in the prediction result is higher than the set confidence threshold, we use the category corresponding to the confidence as a pseudo-label.
  4. Shuffle the labeled data and the pseudo-samples generated in the third step to obtain the shuffle data set, and perform MixUp on the shuffle data set with the label data and the unlabeled data respectively to obtain the updated label data and unlabeled data.
  5. Use the Mean teacher model for semi-supervised fine-tuning, and train to get the final model.insert image description here

5. Common data sets for semi-supervised learning

insert image description here

Guess you like

Origin blog.csdn.net/qq_43456016/article/details/132638116