[Review] Semi-Supervised Semantic Segmentation

Reprint: 2023 Latest Semi-Supervised Semantic Segmentation Review | Technical Summary and Prospects! (qq.com)

Title: A Survey on Semi-Supervised Semantic Segmentation
Paper: https://arxiv.org/pdf/2302.09899.pdf

guide

Comparison of Semantic Segmentation and Instance Segmentation Results

Image segmentation is one of the oldest and most widely studied computer vision (CV) problems. Image segmentation refers to dividing the image into different non-overlapping regions, and assigning corresponding labels to each pixel in the image, and finally obtaining the location of the ROI region and its category information. In general, we divide the segmentation task into semantic segmentation and instance segmentation, the former is to classify each pixel with the corresponding semantic category, so as to assign the same category label to all objects or image regions belonging to this category; the latter goes a step further , trying to distinguish between different instances of the same class (as shown in the figure above). This article mainly focuses on semantic segmentation.

As we all know, traditional image segmentation methods (such as threshold method and clustering method) can effectively deal with fixed scenes, but they are not robust to complex and changeable scenes. With the emergence of deep learning methods, the segmentation performance has been improved qualitatively, and it has become easier to handle complex scenes. However, deep learning methods require a large amount of data and labels, especially pixel-level labels, which require huge manpower and time costs . Therefore, the method based on semi-supervised learning is deeply loved by researchers and practitioners.

These semi-supervised methods extract knowledge from labeled data in a supervised manner and knowledge from unlabeled data in an unsupervised manner, thereby reducing the labeling effort required in fully supervised scenarios and achieving better results than unsupervised scenarios. better results.

The main contributions of this paper are summarized as follows:

  • We provide a new taxonomy of semi-supervised semantic segmentation methods and their descriptions.

  • We conduct a series of experiments with state-of-the-art semi-supervised segmentation methods on the most widely used datasets in the literature.

  • The achieved results, strengths and weaknesses of current approaches, challenges and lines of future work in this field are discussed.

Semi-Supervised Semantic Segmentation Methods

Classification

Classification dendrogram of semi-supervised semantic segmentation method

According to the main characteristics of existing methods in the semi-supervised semantic segmentation literature, we divide the methods into five categories, as shown in the figure above. In addition, the table below lists more detailed method divisions.

Classification of Semi-Supervised Semantic Segmentation Methods

The first category is a GAN-like structure and an adversarial training method between two networks, one as a generator and the other as a discriminator.

The second category is consistent regularization methods. These methods include a regularization term in the loss function to minimize the difference between different predictions for the same image obtained by applying perturbations to the image or related models.

The third category is the pseudo-labeling method. In general, these methods rely on previous predictions made on unlabeled data, and models trained on labeled data to obtain pseudo-labels.

The fourth category is methods based on contrastive learning. This learning paradigm groups similar elements and separates them from dissimilar elements in a specific representation space.

The last category is hybrid methods, which combine methods such as consistency regularization, pseudo-labeling, and contrastive learning.

Adversarial learning method

Generative Adversarial Networks (GANs) have become a very popular framework as they demonstrate good performance in numerous tasks such as image generation, object detection or semantic segmentation. A typical GAN ​​framework consists of two networks, a generator and a discriminator. The purpose of the generator is to learn the distribution of the target data, allowing synthetic images to be generated from random noise. The purpose of the discriminator is to distinguish between real images (belonging to the real distribution) and fake images (generated by the generator). The training process of these networks is carried out in an adversarial manner. The generator tries to confuse the discriminator, generating images that increasingly resemble the target distribution, while the discriminator tries to increase its ability to distinguish real from fake images. This adversarial training process is formally defined as follows:

Equation 1 solves the minimum maximum value of the discriminator D and the generator G. The purpose of the first term of the formula is to maximize the accuracy obtained by D, while the second term tries to improve the quality of the image produced by G.

There are two subcategories of adversarial training methods based on semi-supervised semantic segmentation. A key aspect that distinguishes these approaches is the inclusion or exclusion of generative models during training. Below we detail the different approaches in these two categories.

Adversarial methods involving generators

Frame diagram of semi-supervised semantic segmentation based on generative confrontation method

N. Souly et al. proposed a GAN-based semi-supervised semantic segmentation framework in 2017 [1]. The framework aims on the one hand to process and extract knowledge from large amounts of unlabeled data, and on the other hand to increase the number of training examples available through the synthetic generation of images. Specifically, the method includes a generative network to approximate the distribution of target images, enabling the ability to generate new training examples. The segmentation network takes on the role of the discriminator and takes real and synthetic labels as input images, as shown in the figure above. The loss function for optimization (L_G) and the loss formula as discriminator (L_D) are defined as follows:

The discriminator loss function L_D (Equation 2) consists of three terms. The first term penalizes the model when it labels real samples as fake. The second term penalizes the model when it labels fake samples as real samples. The last term is the responsible supervision term, which tries to enforce the correct classification of each pixel of the labeled set into its corresponding class. γ is the weight of the supervised term during training. Furthermore, the generator loss function L_G (Eq. 3) attempts to improve the quality of the generated image by penalizing G when f_θ detects the synthetic image.

Adversarial methods without generators

A framework for adversarial learning networks without generators

On the other hand, we group those methods that use adversarial training and have a similar structure to GANs but do not include generative models. All methods that we group under this subcategory feature segmentation networks instead of classical GANs. Its output leads to a discriminator that distinguishes the real segmentation map from the one generated by the segmentation network.

This GAN-like architecture for semantic segmentation was originally proposed in this network [2]. The authors propose a fully convolutional discriminator that receives two segmentation maps (one from the label and the other from the prediction of the segmentation model). By adversarial training the discriminative network together with the segmentation model, the final network is able to distinguish the real label map from the predicted map. In this way, this confidence map indicates the segmentation quality of a certain region, so that a high-confidence prediction map can be used instead of a marker during training. This network structure is shown in the figure above. The formulas of the loss functions involved in these methods are as follows:

 Based on this, S4GAN [3] uses a simpler discriminator that no longer predicts per pixel but segments regions as a whole. Additionally, it uses an additional processing branch for training the classifier. The adversarial network [4] approach also incorporates an image-level discriminator and improves the generator loss function by adding a variance regularization term. There are also methods [5] that propose to use two discriminators, one at the image level and the other at the pixel level, both used together to improve the accuracy of confidence region definition in the image.

Both Error Correction Supervision (ECS) [6] and Guided Collaborative Training (GCT) [7] are based on a cooperative strategy, a strategy very similar to the original adversarial strategy. These methods introduce a new network to take on the role of the discriminator, called a correction network in the case of ECS, and a defect detector in the case of GCT. In addition to pixel-level confidence maps, these methods also provide corrections for those regions with low confidence.

Other adversarial approaches combine attention modules with the objective of modeling long-range semantic dependencies. This is the case for this network [8], which also incorporates spectral normalization to reduce instability during training. Another method [9] proposes to combine the attention module with the sparse representation module, which can enhance the model's perception of the target position and edge information.

consistency regularization

Mean Teacher Network Framework

Consistency regularization methods are based on the smoothness assumption [10], i.e., for two points that are nearby in the input space, their labels must be the same. In this sense, semi-supervised learning methods based on consistency regularization exploit unlabeled data by applying perturbations to them and train models that are immune to these perturbations. This is achieved by adding a regularization term to the loss function, which measures the distance between the original prediction and the perturbed prediction:

 

 These methods are all based on Mean Teacher [11], whose core idea is to force the prediction consistency of the student network and the teacher network. The weight of the teacher network is calculated by the exponential moving average (EMA) of the weights of the student network, and the network structure is shown in the figure above.

The main difference between consistent regularization methods based on semi-supervised semantic segmentation is: the way of perturbing the merged data. Based on this, we can divide these methods into four categories. The first, the method based on input perturbation. These methods apply perturbations directly to the input image using data augmentation techniques. They force the model to predict the same label for the original and augmented images. The second, feature-perturbation-based approach, incorporates perturbations internally into the segmentation network to obtain modified features. The third, network perturbation-based approach, obtains perturbation predictions by using different networks, such as networks with different starting weights. The last type combines the previous three types of perturbations.

data disturbance

Framework of consistent regularization method based on data perturbation

First, we group consistent regularization methods that use data augmentation techniques to apply perturbations directly to unlabeled input images. These methods then train a segmentation model that is insensitive to these input perturbations and predict segmentation maps that are as similar as possible to the original image and its augmented version. The key aspect that differentiates these methods is the way they modify the data. We can find different ways in the literature that data augmentation techniques have been applied to semi-supervised semantic segmentation problems. Consistency terms included in these data augmentation-based methods are defined as follows:

 The method [12] applies CutOut and CutMix techniques to semi-supervised semantic segmentation. The key idea is as follows. First, CutOut discards the rectangular part of the mask mark during the training process. The consistency between the predictions of the original image and the modified image is then enforced by a regularization term. On the other hand, CutMix uses a rectangular mask to combine two images to get a new image, in which the part marked by the mask belongs to one of the original images, and the rest belongs to the other image. Another approach [13] extends the previous approach by adding a new term to the loss function, called Consensus Structure Loss, which incorporates the concept of structured knowledge distillation [14].

ClassMix [15] is specially designed for the problem of semantic segmentation. This technique differs from previous CutMix techniques in the form of masking applied to the blended image. In this case, mask-labeled parts coincide with regions in the image that belong to the same class, so parts that belong exactly to one class are copied into the other image, resulting in a new augmented image. The difference between the original and augmented predictions is computed in the same way as the previous technique using a regularization term. Further, ComplexMix [16] proposes to combine the previous data augmentation techniques CutMix and ClassMix.

Besides proposing specific data augmentation techniques for segmentation, other methods [17] use classical data augmentation techniques such as cropping, color dithering or flipping to obtain a perturbed version of the original image.

characteristic perturbation

Framework of Consistency Regularization Method Based on Feature Perturbation

The second way to introduce perturbations during training is to perturb the internal features of the segmentation network. Cross-consistency training (CCT) [18] was proposed to solve the problem of semi-supervised semantic segmentation following this idea, and its network structure extends the supervised segmentation model with encoder-decoder structure (such as DeepLabV3+) and some auxiliary decoders . First, supervised training is performed on the available labeled data using the main decoder. Next, to exploit the unlabeled data, the output of the encoder is perturbed in different ways, resulting in different versions of the same features, which are directed to different auxiliary decoders. Finally, the consistency between the outputs of the auxiliary decoder is strengthened, facilitating similar predictions for different perturbed versions of the encoder output features.

The consistency terms included in these feature perturbation-based methods are defined as follows:

where h is the main decoder, h_k is the kth auxiliary decoder, and K is the number of auxiliary decoders.

network disturbance

Framework of consistent regularization method based on network perturbation

Another way to introduce perturbations during training is to use different segmentation networks, the differences between the networks constitute perturbations in the resulting predictions. Cross Pseudo-Supervision (CPS) [19] follows a training procedure similar to Mean Teacher, but the training of the two networks is done in a parallel and independent manner instead of updating one network based on the EMA of the other. Furthermore, although the two networks share the same architecture, they are initialized with different random weights, increasing the variance between them. An extension of the above approach where the training procedure includes three networks can be seen in this method [20]. Another approach [21] emphasizes the importance of enforcing diversity across networks and proposes to use adversarial examples and resampling strategies to train models on different ensembles.

As with other consistency regularization methods, consistency between network predictions involved in unlabeled images is enforced by a regularization term included in the loss function. The regularization term is defined as follows (for the case of using two networks):

where f_θ and g_φ are different networks trained independently.

joint perturbation

Finally, a joint method of several different types of perturbations mentioned above is introduced.

The method [22] proposes a way to propose combinations of inputs, features, and network perturbations. This approach highlights that a greater variety and intensity of perturbations can cause more problems if the predictions are less accurate. In this sense, to ensure accurate predictions for unlabeled images, this method extends the Mean Teacher method by adding a confidence-weighted cross-entropy loss function instead of the Mean Squared Error (MSE) used by the classic Mean Teacher method. Furthermore, it proposes a new method for feature perturbation via virtual adversarial training [23].

The method [24] proposes a combination of input perturbations, specifically the CutMix technique and feature perturbations. Instead of adding different auxiliary decoders in CCT, this method proposes to apply perturbation directly on features.

Pseudo-labeling method

Pseudo-labeling methods are one of the most widely known and the earliest semi-supervised methods [25]. The idea behind pseudo-labeling methods is simple: generate pseudo-labels for unlabeled images based on predictions made by a model previously trained on labeled data. Then, the labeled dataset is extended with these new image and pseudo-label pairs, and a new model is trained on this new dataset. The loss function of the pseudo-labeling method is as follows:

where y^ is the pseudo-label for image x, generated from the predicted probabilities of the segmentation model f_θ, usually by one-hot encoding, and λ is a parameter that weights the unsupervised part of the loss function.

Based on the differences between the models involved in the training process and the way the pseudo-labels are generated, this paper distinguishes two types of pseudo-labeling methods. The first is a self-training approach, based on only one supervised base model and represents the simplest form of pseudo-labels , where the pseudo-labels are high-confidence predictions generated from themselves. The second is the mutual training approach, which involves multiple models with significant differences , such as different initialization weights or training on different views of the dataset. Each model is retrained using unlabeled images and corresponding pseudo-labels generated by other models involved in the process.

self-training

A Framework for Self-Training-Based Pseudo-Labeling Methods

The self-training method is the simplest pseudo-labeled and semi-supervised method, first proposed in this method [26], described in detail in this review [27], and first applied to deep neural networks in this method [28] . These methods involve retraining the underlying supervised model with its own predictive feedback training set. A typical self-training process includes the following steps:

1. The supervised model is trained on the available labeled data.

2. Obtain predictions from unlabeled data using a previously trained model. Those predictions with confidence above a predefined threshold become pseudo-labels for the unlabeled data and are included in the labeled dataset.

3. The supervised model is retrained using a new dataset consisting of labeled and pseudo-labeled data.

This process can be repeated iteratively, using the model resulting from step 3 to obtain new pseudo-labels, improving the quality of the pseudo-labels in each iteration until no predictions exceed the confidence threshold that need to be processed as pseudo-labels.

In the following, self-training-based semi-supervised semantic segmentation methods are presented, each of which contributes some variants to the original algorithm to improve the learning ability. For example, the method proposed in [29] extends the original self-training process with centroid sampling technique, aiming to address the problem of class imbalance in pseudo-labels.

There are also methods that add some auxiliary networks during self-training. For example, in this method [30], the authors extend the self-training process by adding a residual network. The network is trained using labeled images and subsequently used to refine the pseudo-labels obtained by the segmentation model. The pseudo-labels predicted by the model can be very different from the real label space. This can be a problem when training a model with two label inputs, as it can result in different gradient directions and thus a confusing backpropagation process. This approach [31] includes a segmentation model that uses a shared encoder (i.e., ResNet101) and incorporates two different decoders, one for each label space.

Integrating data augmentation techniques during self-training has also been proposed in different approaches. ST++ [32] applies data augmentation techniques to unlabeled images during self-training. This is combined with a selection phase where, at each iteration of the self-training process, those images with reliable pseudo-labels are prioritized, while those with a higher probability of errors in pseudo-labels are discarded.

However, the application of data augmentation may change the distribution of mean and variance in batch normalization. To address this issue, the method [33] proposes to use distribution-specific batch normalization. Furthermore, the method integrates a self-correcting loss function that performs dynamic reweighting based on confidence to avoid overfitting to noisy labels and underlearning of the most difficult classes.

A common problem faced by such methods is the distribution mismatch between real and pseudo-labels, where the latter is usually biased towards the majority class. To obtain unbiased pseudo-labels, the modified method [34] proposes a strategy of distribution alignment and random sampling combined with data augmentation techniques.

Another proposal focuses on the difficulty of defining an optimal ratio between actual and pseudo-labeled data used during self-training. In this sense, two strategies are proposed to approach this optimal value during iterative retraining, one based on random search (RIST) and the other employing a greedy algorithm (GIST) [35].

mutual training

A Framework for Pseudo-Labeling Methods Based on Mutual Training

One of the main drawbacks of the previously described self-training methods is the lack of mechanisms to detect their own errors. Instead of learning from their own predictions, mutual learning [36] methods extend self-training methods and involve multiple learning models, each trained using pseudo-labels generated by other models. The diversity that exists among the participating models is key to the correct performance of such methods [37]. This is why different existing approaches try to explicitly induce differences between the underlying supervised models that make up co-training methods, for example, by initializing such models with different pretrained weights or by training each model with different views or training subset of the set. In other studies, similar approaches are categorized as divergence-based strategies [38] because they mainly rely on exploiting the differences in predictions between the involved models, multi-view training [39] or co-training [40].

Dynamic Mutual Training (DMT) is a mutual learning method applicable to semi-supervised scenarios and semantic segmentation problems, which aims to exploit the divergence between models to detect errors in the generated pseudo-labels. The method takes these differences into account through a loss function that is dynamically reweighted during training according to the differences between two different models trained independently using pseudo-labels generated by the other model. Therefore, a larger difference in a particular pixel indicates a greater probability of error, so it is weighted lower in the loss function and has less impact on training than other pixels or regions in the image where there are differences.

Another approach is to extend the previous method (DMT) [41] with a pseudo-label augmentation strategy. In order to maintain the acquired knowledge throughout the training process and avoid the model from biasing the last learned class, the authors propose a strategy that considers the pseudo-labels generated in previous stages to improve the current pseudo-labels.

Comparative study

Contrastive learning focuses on high-level features, enabling the network to distinguish categories well without ground truth labels. In other words, these types of methods group similar samples and remove them from different samples in the feature space. In many contrastive learning methods, the target samples to be compared are called queries, while similar and dissimilar samples are called positive and negative keys, respectively. Due to the lack of annotations in the data, samples considered similar during training are augmented versions of the same sample, while the rest of the data are considered different samples. Specifically, pairs of enhanced images are usually obtained differently in the most relevant contrastive methods. Some of them apply data augmentation techniques (such as cropping, color dithering or flipping), such as the SimCLR method [42]. Other methods divide the image into different overlapping sub-blocks and treat these blocks as independent images like the CPC method [43].

Due to the success of such methods, even outperforming their supervised methods on some specific problems, a series of contrastive learning methods specially designed for semantic segmentation have been proposed in recent years. ReCo [44] is one of the first contrastive learning based methods in the field of semantic segmentation. The approach consists of chaining an auxiliary decoder on top of the segmentation model encoder, which maps input features to a higher-dimensional representation space, where querying and sampling of keys are performed. With the proposed contrastive loss function, queries are forced to be close to positive keys in the representation space and away from negative keys. Since it is impractical to use all pixels of a high-dimensional image to compute a contrastive loss function, the ReCo method incorporates an active sampling strategy that samples less than 5% of the total pixels in the image. On the one hand, this approach makes pixels of classes that are usually confused with the query class have a higher probability of being selected as key negatives. On the other hand, it relies on the prediction confidence to select those pixels that are harder for segmentation models to classify as query pixels.

Another contrastive learning approach proposed for semi-supervised semantic segmentation is based on pure ortho-contrastive learning [45], which only samples positive keys. A key element of the approach is the creation and dynamic updating of a memory containing a subset of samples from the labeled set. Select samples with higher prediction confidence for storage. Subsequently, a contrastive loss function ensures that the characteristics of the sample are close to those of similar samples stored in the memory bank.

mixed approach

The last part of this chapter is to introduce the integration method of the previous methods. This method attempts to optimize the model by taking advantage of both pseudo-labeling and consistency regularization methods. For example, the method [46] proposes a three-stage self-training framework with consistency regularization in the middle stage. Specifically, a multi-task model is integrated during self-training that trains on the segmentation problem using consistency regularization (Task 1) and introduces statistics from pseudo-labels into the optimization process (Task 2).

Likewise, Adaptive Equalization Learning (AEL) [47] also combines features of consistency regularization and pseudo-labeling methods. The AEL method is based on FixMatch [48], a widely used hybrid method originally proposed for image classification. In segmentation problems, it is common for models to underperform certain classes, mainly due to their difficulty or negative imbalance relative to the rest of the classes. To this end, the method proposes a confidence bank that dynamically stores the performance of each class during training. Data augmentation techniques and adaptive equalization sampling are used to support training for those underrepresented groups.

Pseudo-Seg [49] also integrates features of consistency regularization and pseudo-labeling methods. The authors highlight the fact that common methods of obtaining pseudo-labels (from the output of a trained segmentation model and applying a confidence threshold) can fail and result in low-quality pseudo-labels. To address this issue, a method focusing on the structured and quality design of the enforced pseudo-labels is proposed. The method generates pseudo-labels from two different sources: on the one hand, the output of a segmentation model, and on the other hand, the output of an activation map-like algorithm [50]. Unlike segmentation tasks, which seek to obtain dense and accurate predictions, class activation algorithms only need to predict coarser-grained outputs.

A key bottleneck of semi-supervised segmentation methods may be the separate processing of labeled and unlabeled data during training. This is the problem posed by hybrid GuidedMix-Net [51] and proposed to improve: Interpolation between labeled and unlabeled image pairs is achieved to capture the interaction between the two.

Recently, there has also been considerable interest in methods that combine consistency regularization with contrastive learning. Directed Context Awareness (DCA) [52] pointed to the difficulty of fitting simulations in semi-supervised settings, where the context of a given object is limited to a reduced set of labeled images. This may cause the segmentation model to pay too much attention to these specific contexts and not focus on some important features of the objects to be segmented. To solve this problem, the DCA method incorporates a novel data augmentation technique that makes two cuts of the same image with overlapping regions. In this way, it models two different contexts of the region and enforces consistency between the two slices via a contrastive loss function.

This method [53] tries to achieve the same two properties: consistency in the prediction space and contrast in the feature space. On the one hand, they use an l2 loss to enforce consistency between predictions from two augmented versions of an unlabeled image. On the other hand, they integrate contrastive learning through a contrastive loss function, making positive (similar) pairs closer and negative (dissimilar) pairs farther away in the feature space. In addition, C3-SemiSeg [54] not only utilizes the methods of consistency regularization and contrastive learning, but also integrates cross-set contrastive learning to improve feature representation.

This method [55] proposes a method to combine a cross-teacher training (CCT) based consistency regularization framework with two complementary contrastive learning modules. The CCT framework reduces error accumulation between teacher and student networks, while the contrastive learning module promotes class separation in feature space. The method [56] proposes a data augmentation technique that attempts to preserve image context. Furthermore, a new adversarial dual-student framework is proposed to improve the performance of the classic Mean Teacher.

experiment

Ratio of labeled to unlabeled data

The PASCAL VOC 2012 dataset has three marker scales: 1/100, 1/50, 1/20, and 1/8; the Cityscapes dataset has only one: 1/8.

Segmentation Performance Comparison on PASCAL VOC Segmentation Task

In configurations with labeled/unlabeled ratios of 1/100, 1/50, and 1/20, DMT achieved the highest accuracy, outperforming the suboptimal methods by an average of 1~3%. The second highest accuracy is obtained in the scale 1/8 configuration, which is only 0.5% lower than the highest accuracy.

Visual comparison of segmentation results on the Cityscapes dataset (1)

Visual comparison of segmentation results on the Cityscapes dataset (2)

It can be clearly observed that the segmentation results of DMT are closer to the ground truth labels. Compared with the two methods, the target area segmented by DMT is more complete, and the boundaries between target areas are grasped more accurately.

Challenges and Prospects

Evaluation Criteria : The different studies we found in the semi-supervised semantic segmentation literature did not propose the same experimental framework (i.e. using different datasets, different data partitions, different implementations, etc.). Proposing a standard and realistic experimental and evaluation framework that all researchers can adopt will be a key point in the development of this research field.

Method families with potential for improvement : We highlight two categories that may have greater potential for future research. First, we highlight the pseudo-labeling approach, especially the co-trained subcategory, which achieves the best results in our experimental analysis . However, only two semi-supervised segmentation methods exist in this subcategory, so we believe it has much room for improvement and development. Furthermore, we also see hybrid methods as a very promising category for future research due to their novelty and possibility of different combinations.

Diversity of base models : Many methods employ multiple base models, and the diversity of these models can be a key factor in obtaining a good final model. However, these methods are usually limited to selecting the state-of-the-art supervised segmentation model to obtain a poorly diverse set of models, and do not attempt to investigate this decision in more depth. Possible future research directions can focus on studying the impact of inter-model diversity on the final results of semi-supervised segmentation methods.

Evaluate on more realistic scenarios : Some of the most widely used datasets in fully-supervised and semi-supervised segmentation problems are object-centric image datasets (eg, PASCAL VOC 2012). This type of image represents a very controlled scene, which is quite different from the scene in the real world. This may lead to good results for the model in such datasets, but may not be useful in real applications . Emerging datasets (e.g., Cityscapes) present less controlled images and more semantic dependencies between classes. These types of datasets require new approaches to deal with less controlled semantic dependencies between image and modeling classes.

New Trends : Transformers [57] are a specific type of network architecture originally proposed for natural language processing problems, and its encoding philosophy is quite different from CNN. More recently, these models have started to be applied to CV problems. These models can learn semantic relationships between classes, even between classes that are far from each other in the image. This is desirable in real-world situations where such relationships are abundant. Although transformers have recently started to be applied to supervised semantic segmentation with promising results, only a few approaches have attempted to introduce them into semi-supervised learning scenarios.  Therefore, the application of this new method in semi-supervised semantic segmentation can be considered as one of the most promising research directions in the future.

Summarize

This paper aims to build around semi-supervised segmentation methods and propose challenges and future research trends.

One of the main contributions of this paper is a new taxonomy that groups all previous work (a total of 43 recently published methods relevant to this field) into five categories: adversarial methods, consistency regularization, pseudo-labeling, constrained Learning and mixed methods. In this way, we provide readers with a fast and accurate way to learn about the state-of-the-art in this field, together with a detailed description of each existing method.

The analysis of state-of-the-art and defined taxonomies is complemented by an experimental study comparing methods across all different classes under homogenous experimental conditions (using two of the most common datasets in the field: PASCAL VOC 2012 and Cityscapes). This gives the reader an intuition about the performance of each of them. The experiment consists of 10 methods, and we summarize the methods that belong to the mutual training category (ie, DMT) as the method that provides the best performance.

Finally, we reflect on current challenges and potential future research directions in semi-supervised segmentation, emphasizing the need for standardization of experimental and evaluation frameworks, the convenience of semantically rich realistic benchmarks using images of complex scenes, and the dependencies between classes, And the great potential of vision transformers recently applied to CV in semi-supervised scenarios.

References

[1]Semi supervised semantic segmentation using generative adversarial network: .

[2]Semantic segmentation using adversarial networks: .

[3]“Semi-supervised semantic segmentation with high- and low-level consistency: .

[4]Semi-supervised semantic segmentation based on confrontation network: .

[5]Semi-supervised semantic image segmentation using dual discriminator adversarial networks: .

[6]Semi-supervised segmentation based on errorcorrecting supervision: .

[7]Guided collaborative training for pixel-wise semi-supervised learning: .

[8]Stable selfattention adversarial learning for semi-supervised semantic image segmentation: .

[9]Semi-supervised semantic segmentation using an improved generative adversarial network: .

[10]Semi-supervised learning: .

[11]Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results: .

[12]Semi-supervised semantic segmentation needs strong varied perturbations: .

[13]Structured consistency loss for semi-supervised semantic segmentation,: .

[14]“Structured knowledge distillation for semantic segmentation,: .

[15]Classmix: Segmentation-based data augmentation for semi-supervised learning,: .

[16]Complexmix:Semi-supervised semantic segmentation via mask-based data augmentation,: .

[17]Semi-supervised semantic segmentation constrained by consistency regularization,: .

[18]emi-supervised semantic segmentation with cross-consistency training: .

[19]Semi-supervised semantic segmentation with cross pseudo supervision,: .

[20]Deep tri-training for semi-supervised image segmentation,: .

[21]Deep co-training for semi-supervised image segmentation: .

[22]Perturbed and strict mean teachers for semi-supervised semantic segmentation,: .

[23]Virtual adversarial training: A regularization method for supervised and semi-supervised learning,: .

[24]Perturbation consistency and mutual information regularization for semi-supervised semantic segmentation: .

[25]Semi-supervised learning literature survey: .

[26]Unsupervised word sense disambiguation rivaling supervised methods,: .

[27]Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study,: .

[28]Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks: .

[29]Improving semantic segmentation via efficient self-training,: .

[30]A residual correction approach for semi-supervised semantic segmentation: .

[31]Digging into pseudo label: a low-budget approach for semi-supervised semantic segmentation: .

[32]t++: Make self-training work better for semi-supervised semantic segmentation: .

[33]A simple baseline for semi-supervised semantic segmentation with strong data augmentation: .

[34]Re-distributing biased pseudo labels for semi-supervised semantic segmentation: A baseline investigation: .

[35]The gist and rist of iterati e self-training for semi-supervised segmentation: .

[36]Deep mutual learning,: .

[37]A new analysis of co-training: .

[38]A survey on deep semi-supervised learning: .

[39]An overview of deep semi-supervised learning: .

[40]A survey on semi-supervised learning: .

[41]“Pseudoseg: Designing pseudo labels for semantic segmentation: .

[42]A simple framework for contrastive learning of visual representations: .

[43]epresentation learning with contrastive predictive coding: .

[44]Bootstrapping semantic segmentation with regional contrast: .

[45]Exploring simple siamese representation learning: .

[46]A three-stage self-training framework for semi-supervised semantic segmentation,: .

[47]Semi-supervised semantic segmentation via adaptive equalization learning: .

[48]“Fixmatch: Simplifying semi-supervised learning with consistency and confidence: .

[49]Pseudoseg: Designing pseudo labels for semantic segmentation: .

[50]Grad-cam: Visual explanations from deep networks via gradient-based localization: .

[51]Guidedmix-net: Learning to improve pseudo masks using labeled images as reference,: .

[52]Semi-supervised semantic segmentation with directional context-aware consistency: .

[53]Pixel contrastive-consistent semi-supervised semantic segmentation,: .

[54]C3-semiseg: Contrastive semi-supervised segmentation via cross-set learning and dynamic class-balancing: .

[55]Semi-supervised semantic segmentation with cross teacher training: .

[56]Adversarial dual-student with differentiable spatial warping for semi-supervised semantic segmentation: .

[57]An image is worth 16 x16 words: Transformers for image recognition at scale: .

Guess you like

Origin blog.csdn.net/m0_61899108/article/details/129961042