The latest roundup! The five major directions sort out the progress of semi-supervised target detection one by one

guide

The target detection algorithm in the field of supervised learning is relatively mature, but the label cost is too high and there are certain limitations. Thus, semi-supervised object detection (SSOD) has received extensive attention, aiming to learn information by using a small amount of labeled data and a large amount of unlabeled data. This article summarizes it from five aspects: First, briefly introduce several methods of data enhancement. Then, mainstream semi-supervised strategies are divided into pseudo-label, consistent regularization, graph-based and transfer learning-based methods, and some methods with difficult settings are introduced. Furthermore, relevant loss functions are introduced, common benchmark datasets are outlined and the accuracy of different representative methods is compared.

background knowledge

We can regard semi-supervised learning as a method combining supervised learning and unsupervised learning. The main research point is how to make reasonable use of labeled and unlabeled samples in the training process. Some assumptions can be applied to establish predictive samples and learning objectives. The relationship between:
(1) smoothing assumption
assumes that when two samples are located close to high-density regions, they are more likely to have the same class label
(2) clustering assumption
assumes that when two samples are in the same cluster, They may belong to the same category
(3) Manifold assumption
Assume that two samples have similar class labels when they are located in a small local neighborhood of a low-dimensional manifold
Based on the above assumptions, some semi-supervised learning methods for image classification can be divided into The following categories: generative methods; graph-based methods; consistency regularization methods; pseudo-label methods and hybrid methods. This paper begins with a summary of semi-supervised object detection.
insert image description here

流程角度对 SSOD的分类

data augmentation

Data augmentation is crucial to improve model generalization and robustness, which is the first step in SSOD. In order to improve the robustness of the model, the unlabeled data information is rationally utilized, and the augmented data is constrained by consistency regularization to ensure the consistency of the output labels. Due to the differences between different methods, the methods of enhancement are also very different.
(1) Strong enhancement
The strong enhancement method can enrich the dataset and easily improve the model performance. Some methods augment data with color dithering, grayscale, Gaussian blur, and clipping. Also, the regularization effect of cutout is weak.
(2) Weak enhancement
Weak enhancement Weak enhancement usually uses simple graphic transformation. Random horizontal flipping, random resizing and multi-scaling are some common weak enhancements. Mixmatch expands the training dataset by randomly mixing images from different categories. Mix up has a class blur problem of background and object mixing.
(3) Hybrid
Augmentation In order to avoid the above problems, both weak augmentation and strong augmentation are applied to the mini-batch of unlabeled images in the MUM method. Furthermore, Instant Teaching directly applies mosaic into the pseudo-label based SSOD framework. STAC explores different variants of transform operations and identifies a set of effective combinations: 1) global color transform; 2) global geometric transform; 3) box-level transform.

semi-supervised strategy

After data augmentation, the next step is to design a training framework to integrate information from labeled and unlabeled images. Currently, SSOD methods follow four strategies:
1. Pseudo-labels
estimate pseudo-labels for unlabeled images by using a pre-trained model, and then jointly train the model using labeled and unlabeled data after augmentation. Most of them are based on two-stage anchor-based detectors, such as Faster-RCNN.
insert image description here

(1) Self-training
uses labels to train the teacher model, which is used to predict unlabeled data, and finally uses all the data to train the student model. Many SSOD methods utilize the information of unlabeled samples for pseudo-label prediction through self-training, and improve model performance by utilizing high-confidence samples with pseudo-labels during training. A typical example is STAC (the pipeline is shown in the figure below), which is an algorithm based on hard pseudo tags. Labeled data is used to train a teacher model that can predict unlabeled data, using a threshold to select high-confidence pseudo-labels. Furthermore, the pseudo-labels are computed using an unsupervised loss with strong augmentation on unlabeled data and a supervised loss with strong augmentation on unlabeled data. Label data. We need to note that STAC only generates one-time pseudo-labels, and then initial predictions using pseudo-labels will limit the improvement of model accuracy.
insert image description here

STAC

ISTM (as shown below) proposes an interactive self-training model in order to avoid overfitting unlabeled data by ignoring the difference in the detection results of the same image at different training iterations. On the one hand, it utilizes non-maximum suppression (NMS) to fuse object detection results in different iterations, and on the other hand, it uses two region-of-interest (ROI) heads with different structures to estimate each other's pseudo-labels.
insert image description here

ISTM

(2) Optimized pseudo labels
In order to alleviate the confirmation bias problem and improve the quality of pseudo-labels, most methods correct pseudo-labels during the training phase. Unbiased Teacher adopts a two-stage framework (as shown in the figure below), which alleviates the overfitting problem by using pseudo-labels to train the region proposal network (RPN) and RoI head, solves the problem of pseudo-label bias and uses exponential moving average (EMA ) and focal loss to improve the quality of pseudo-labels. The confidence of the classification is used to filter the false labels of the detection box and cannot reflect the accuracy of the localization. There are some methods that multiply the average classification confidence level by the original detection classification confidence level as an index to reflect the classification accuracy and positioning accuracy. To refine the pseudo-label quality, Cross Rectify exploits the difference between detectors to identify self-errors while using a cross-correction mechanism.
insert image description here

Unbiased Teacher

(3) Mean teacher
The mean teacher consists of a teacher model whose weights are obtained from the EMA of the student model and a student model that needs to learn the objects generated by the teacher. For example, Soft Teacher proposed an end-to-end semi-supervised object detection method. In order to fully utilize the information of the teacher model, the unlabeled bounding box classification loss is weighted by the classification score generated by the unlabeled bounding box teacher network. Furthermore, box regression is better learned by selecting candidate boxes whose box regression variance is smaller than a threshold as pseudo-labels.
insert image description here

Likewise, Instant-teaching proposes a core correction scheme that uses instant pseudo-labels and extended weak-strong data augmentation to teach during each training iteration.
insert image description here

Instant-teaching

(4) Soft labels
is different from STAC using hard labels. Humble Teacher applies soft labels. When the head performs class-dependent bounding box regression, it obtains soft labels from the predicted distribution of class probabilities and the offset of all possible classes. Target. To provide more information, Humble Teacher uses a large number of region proposals and soft pseudo-labels as training targets for the student model.
insert image description here

Humble Teacher

(5) Dense guidance-based
Dense Learning proposes an adaptive filtering strategy and aggregated teachers to generate stable and accurate pseudo-labels. In addition, a non-deterministic consistency regularization term is adopted between scale and shuffled patches to improve the detector performance. Generalization.
insert image description here

Dense Learning

This type of method also introduces a region selection technique to highlight key information and suppress the noise carried by dense labels. To replace sparse pseudo-labels with more informative dense supervision, dense teacher guidance (DTG) proposes a novel "dense-to-dense" paradigm that integrates DTG into student training. It also introduces inverse NMS clustering and ranking matching, enabling students to receive adequate, informative, and dense guidance from their teachers to improve model performance.
(6) Point labeling
can provide the location information of the instance and save labeling time. Omni-DETR utilizes different types of weak labels to generate accurate pseudo-labels through a filtering mechanism based on binary matching. Point DETR extends DETR by adding a point encoder. Based on the classic R-CNN architecture, Group R-CNN proposes instance-level proposal grouping and instance-level representation learning parameter generation through instance-aware feature enhancement and instance-awareness. Correspondingly, its framework is as follows:
insert image description here

Group R-CNN

(7) Uncertainty quantification
Pseudo-labels inherently contain label noise, which brings uncertainty to SSOD training. Some methods achieve noise-resistant learning by introducing regional uncertainty quantification and boosting multimodal probability distribution outputs. We also introduce uncertainty against noise into semi-supervised learning by exploiting the proposed uncertainty quantification as a soft target and promoting multimodal probability distributions.
insert image description here

Combating noise

In order to improve the filtering of predicted bounding boxes and achieve higher training quality for students, NOTE-RCNN introduces an additional classification model IL-Net for bounding box localization, utilizing lightweight branches to predict bounding box intersections on unions (IoU) quality. As shown in the figure below, with a large number of image-level labels and a small number of seed box-level annotations, the detector uses two classification heads and one distillation head to improve mining accuracy, mask the negative sample loss and only train the box regression head with seed annotations to Eliminate the hazards of inaccurate information.
insert image description here

(8) Data distillation
Some SSOD methods propose a self-distillation algorithm based on hint learning and ground truth bounded knowledge distillation to utilize purified data, integrate predictions from multiple transformations of unlabeled data through a single model, and manually label data Retrain the model on the union of automatically labeled data.
(9) Visual and language model-based
Most previous works only utilize a small set of labeled data to generate pseudo-labels, while CV and NLP are able to generate pseudo-labels for both known and unknown categories. VL-PLM starts with a general training strategy for object detectors using unlabeled data. To improve pseudo-label localization, it uses category-independent proposal scores and repeated applications of RoI headers, in addition, judges the scores of cropped regions by vision and language models to provide better pseudo-labels.
insert image description here

VL-PLM

Prompt Det is able to detect new categories without any manual annotation. As shown in the figure below, Prompt Det is divided into 3 stages. In the third stage, the basic categories and novel categories are sent to the self-training network, and region prompt learning is used to generate more accurate pseudo-labels.
insert image description here

Prompt Det

2. Consistency regularization
The second strategy is based on consistency regularization, as shown in the figure below, these methods regularize the consistency of the output of the same unlabeled image under different forms of data augmentation.
insert image description here

CSD is a typical semi-supervised object detection method based on uniform regularization, which can work on single-stage and two-stage detectors. In the first stage, the consistency loss for classification and localization is computed based on the spatial locations of the two images. In the second stage, features are extracted by generating the same set of RoIs for both images through the same RPN, and then the consistency loss is calculated.
insert image description here

CSD

In consistency training, PseCo proposes a multi-view scale-invariant learning including label-level and feature-level consistency mechanisms to achieve feature consistency by aligning and shifting feature pyramids between two images with the same content but different scales. sex.
3. Graph-based approach
The third strategy is a graph-based approach. Labeled and unlabeled data points can be viewed as nodes of a graph, and the goal is to propagate a label from a labeled node to an unlabeled node using the similarity of the two nodes, which is reflected by the strength of the edge between the two nodes node. This type of method is an important branch of semi-supervised learning in object tracking tasks, which can effectively utilize the combined information of labeled and unlabeled samples. It improves tracking accuracy by independently running a graph-based semi-supervised classification method on each graph, thereby exploiting the intrinsic structural features of the sample set containing labeled and unlabeled samples.
4. Transfer Learning
Obtaining object-level annotations (with category annotations and bounding box annotations) is always harder than image-level annotations (with category annotations). Therefore, how to transfer the knowledge of existing categories with image annotations and object annotations to categories without object-level annotations is worth exploring. The fourth strategy is based on transfer learning, as shown in the figure below, which learns the difference between two tasks and transfers data knowledge without bounding box annotations from the classifier to the detector.
insert image description here

The figure below is a large-scale adaptive detection (LSDA) framework, the algorithm uses data with image-level annotations and data with object-level annotations to learn a classifier, and then converts the classifier into a classifier. Use the second type of data to build the network as a detection network, and finally put all the data into the network to get an adaptation network.
insert image description hereOn top of this, the LSDA framework
also finds that visual similarity and semantic relevance are complementary for detection tasks. As shown in the figure below, a similarity-based knowledge transfer model is proposed, which shows how knowledge of object similarity from visual and semantic domains can be transferred to adapt an image classifier to an object detector in a semi-supervised setting.

insert image description here

loss function

The designed loss has a large impact on what SSOD can learn from the data. In most SSOD methods, the overall loss is defined as the weighted sum of supervised loss and unsupervised loss, which can be expressed as follows:
insert image description here

where and denote the supervised loss for labeled images and the unsupervised loss for unlabeled images, respectively, controlling the contribution of unsupervised loss. Both include classification loss and regression loss. The loss for classification and localization is usually instantiated as a weighted sum of the standard cross-entropy loss and the smoothed L1 loss.

Experimental results

The author first summarizes the AP indicators of the SOTA method on the MS-COCO dataset, 0.01, 0.02, 0.05, and 0.1 respectively represent the percentage of label data
insert image description here

Then
evaluate
insert image description here

Summarize

Both supervised and semi-supervised algorithms can be applied to object detection tasks. Supervised algorithms can achieve good performance, but have certain limitations, need to learn a large amount of labeled data, and have high requirements for data quality. The semi-supervised algorithm introduced in this paper only needs a small amount of labeled data and a large amount of unlabeled data to improve the quality of the model, saving the labeling cost in practice. This paper presents a complete review of newly proposed semi-supervised object detection methods in the literature. Classify the relevant methods according to their rationale and describe their advantages and disadvantages. However, semi-supervised object detection methods still face many challenges:
(1) Are the pseudo-labels accurate?
Self-training-based methods provide flexibility for the further development of semi-supervised learning to address object detection. Self-Training Models The possibility of using a single self-training model to learn representations for unlabeled data and build intermediate labeling systems to deal with the under-labeling problem in the semi-supervised domain. But the problems of confirmation bias and excessive use of pseudo-labels have been overlooked. Subsequent improvement methods need to explore how to utilize unlabeled data more effectively, alleviate the problem of confirmation bias and improve the quality of pseudo-labels.
(2) Labeling Form
Pseudo-labeling has been proven effective in SSOD and achieved SOTA on benchmarks such as MSCOCO and Pascal VOC. But the process of generating pseudo-labels requires several additional steps, such as NMS, thresholding, and label assignment. The dense bootstrap based approach is a seminal work that takes the first step towards a simpler and more effective form of pseudo-labeling. Soft labels coupled with a balanced number of teacher area proposals are key to superior performance. Point label based methods also achieve a better cost-accuracy trade-off. Therefore, setting multiple label forms is useful to obtain more detailed information of samples.
(3) Class balance
The current SSOD method helps to effectively improve the detection accuracy under balanced data and the robustness to noisy samples. However, the training stage requires a large amount of balanced labeled data, which is difficult to apply to real-world scenarios. It is also meaningful to consider the class balance problem in unlabeled images in the future and develop more new resource-friendly semi-supervised object detection methods.

Guess you like

Origin blog.csdn.net/limingmin2020/article/details/132422257