Towards Open World Object Detection

Summary

question

The author proposes a new field: Open World Object Detection, which is defined as:

  1. Objects of unknown categories may appear in the test set, and the network needs to recognize them as unknown categories;
  2. If an unknown label is given later, the network needs to be able to learn new categories incrementally.

Currently there are similar fields as:

Open Set Classification: Although all works in this field can identify unknown categories, they cannot dynamically update incrementally in multiple training episodes.
Open World Classification: Models in this field can recognize objects, and when an unknown label is given, its You can update yourself, but these works have not tested Open Set Detection on the benchmark of image classification
: Research in this field has found that traditional object detection models often recognize objects of unknown categories as known categories, so their processing methods are usually : 1. Add a background category 2. Remove objects of unknown category. But neither of the above two methods can be used in a real dynamic environment

significance

Although the target detection technology has been developed relatively maturely, if the computer is to be recognized like the human eye, there is a function that has not yet been achieved - that is, it can recognize all objects in the real world like a human being, and can Gradually learn to recognize new unknown objects. Advances in open-set and open-world image classification cannot be simply applied to open-set and open-world object detection, which differ in that: during the training of an object detector, those unknown The target serves as the background. Instances of many unknown classes have been introduced into object detectors together with known objects. Since they are not labeled, these unknown instances will be learned as background when training the detection model.
In this paper, a new open-world object detection model (ORE) based on contrastive clustering and energy-based unknown recognition is proposed. Open-world object detection models are a new problem, that is, a model should be able to identify instances of unknown objects as "unknowns" in a general way, and then learn to recognize them as the training data is progressively acquired.

Strengths

network structure

Alt

general idea

Define the known category at time t as K t = 1 , 2 , . . . , C ⊂ N + K^t={1,2,...,C}\subset\mathbb{N}^+Kt=1,2,...,CN+ whereN + \mathbb{N}^+N+ represents a collection of positive objects. In order to adapt to the dynamic real world, an additional unknown category is definedU = C + 1 , . . . U={C+1,...}U=C+1,. . . , then the known data set isD t = X t , Y t D^t={X^t,Y^t}Dt=Xt,Yt , where the image collection containsMMM picturesIM I_MIM, the corresponding label is YM Y_MYM,每个label Y i = y 1 , . . . , yk Y_i={y_1,...,y_k}Yi=y1,...,ykContains pictures IM I_MIMIncluded KKGroud truth yi y_i of K objectsyi ,其中 y i = l k , x k , y k , w k , h k y_i={l_k,x_k,y_k,w_k,h_k} yi=lk,xk,yk,wk,hk l k ∈ K + l_k\in K^+ lkK+

The main functional logic of this task is: Model M c M_cMcCan recognize CCC objects and unknown objects (unknown, label 0), and then fromU t U_tUtMark nn inn categories, and provide corresponding training samples, the network can add new recognitionnn objects, that is,MC + n M_{C+n}MC+n, then K t + 1 = K t + C + 1 , . . . , C + n K_{t+1}=K_t+{C+1,...,C+n}Kt+1=Kt+C+1,...,C+n

Innovation

The main body of this article is improved based on Faster-RCNN.

contrastive clustering

Idea: For the feature representation in the latent space, the same category should be close to each other, and different categories should be farther away; based on this expectation, a clustering constraint is added to the features extracted by the backbone network.
Specific implementation: For the latent space features of each category, count the feature mean of iterative samples over a period of time as the cluster center, and constrain the expectation that the sample features of this class are close to the center, and the sample features of other classes are far away from the center; clustering Centers are constantly updated during training.
Advantages: 1. It can help the network to distinguish the difference between the representation of the unknown category and the representation of the known category 2. Promote the network to learn the potential representation of the unknown category without covering the original category representation in the latent space

Using RPN to automatically label unknown categories

Thought: In actual use, it is impossible for us to mark targets of unknown categories in the training set in advance. Considering that the candidate frames generated by RPN only distinguish between foreground and background, the background frame is actually an area that has not been marked. The author believes that the higher scores in these background frames are likely to be unlabeled targets. Therefore, the author directly generates from RPN. The top_k boxes directly sorted by score in the background box are regarded as unknown category targets.
Advantages: RPN network can be used to automatically label unknown objects in pictures

Recognition and Correction of Unknown Classes Based on Energy Patterns

Idea: This part is based on Energy based models (EBMs), given the potential space FFFeaturesf ∈ F f\in FfF and its corresponding labell ∈ L l\in LlL , the author's goal is to find an energy functionE ( ⋅ ) E(\cdot)E ( ) , whose output is a scalar that estimates the observed variableFFF and the set of possible output variablesLLThe compatibility between L , namely E ( f ) : R d → RE(f):\mathbb{R}^d\rightarrow\mathbb{R}E(f):RdR. _
EBMs assign low energy values ​​to in-distribution data, and vice versa, high energy values, so the high and low energy values ​​can be used to indicate whether a sample belongs to an unknown category.
Advantages: It can convert the classification head in the standard Faster R-CNN into an energy function. During inference, the energy value of the inference sample can be calculated and corrected based on the sample category.

Weaknesses and Improvements

anchor-based

The model relies heavily on setting dense target candidate boxes, such as setting k anchor boxes for each pixel of the feature map (HW), so that there will be thousands of accnhors (HW*k), which is too inefficient.
Improvement:
Use a sparse R-CNN network as the backbone network.
The input consists of an image, a set of proposal boxes and proposal features, where the latter two are learnable parameters. The backbone extracts feature maps, inputs each proposal box and proposed features into its unique dynamic head, generates target features, and finally outputs classification and localization.
Network formula:
image + proposal boxes – (fusion) --> ROIs
RoIs + proposal features – (fusion) -- (full connection layer) –> predictions
learnable suggestion box learnable proposal box:
use a fixed learnable Proposal box ( N × 4 ) (N\times4)(N×4) provides region proposals, an alternative to RPN networks.
Represented by a four-dimensional parameter from 0 to 1, representing the normalized center coordinates ( x , y ) (x,y)(x,y), height and width, updated using the backpropagation algorithm.
is the statistical information of potential object locations in the training set, which can be seen as an initial guess of the region in the image most likely to contain the object, apart from the influence of the input content.

Learnable proposal feature:
Proposal boxes only provide a rough localization of objects and lose many detailed information such as object pose and shape. Therefore, the proposed feature ( N × d ) (N\times d)(N×d) is introduced, which is a high-dimensional vector (such as 256) to encode instance features.
The quantity is the same as the suggestion box.

Dynamic instance interactive head Dynamic instance interactive head:
Given N proposed boxes, Sparse R-CNN first uses the RoI Align operation to extract features for each box. Each box feature will then be used to generate the final prediction using the prediction head.

generate unknown class

The method of automatically generating unknown categories actually implies an application scenario: the target of the unknown category has appeared in the training set, but it has not been labeled. Only in this way can the extracted proposal of the unknown category be meaningful. However, in many scenarios, unknown categories often do not appear in the training set in advance (if they appear, they will be marked), but may be generated in the future, which may lead to risks in the practical application of this method.
Improvement:
Train a detection network with strong generalization ability, preferably so that it can generalize to targets that have not appeared in the training set, but the pattern is similar to the target pattern in the training set; train a classification network with anomaly detection ability
, Ability to automatically discover that certain targets are inconsistent with any known target pattern.

contrastive clustering

When implementing comparative clustering in this paper, the unknown categories are also clustered, but the unknown categories may be diverse, and they are not actually the same category. This clustering may have side effects in theory, and it is worthy of further experimental verification. .
Improvement:
Further, perform preliminary classification of unknown categories, or replace the current clustering method with a better method.

Guess you like

Origin blog.csdn.net/dawnyi_yang/article/details/125222529