DETR、conditional DETR、Deformable DETR

DETR is an algorithm model that applies the transformer mechanism to the field of target detection. The main idea is to use the encoder-decoder architecture of the transformer and use the attention mechanism to achieve end-to-end object detection results.

DETR

insert image description here
As shown in the figure, the structure of the DETR model is mainly composed of backbone, transformer block, and prediction heads.
backbone
As the preliminary feature extraction of the input image, the backbone is mainly completed by a convolutional neural network such as (resnet50). For the input image (B×3×H×W), where B is batch_size, H, and W are the width and height of the image. The features extracted by the offensive CNN are then passed through a fully connected layer to obtain a feature map (B×HW×D, D is a hyperparameter, indicating the dimension of the hidden layer).
Transformer block
DETR uses the encoder and decoder of the transformer, the encoder part, the features extracted by the backbone plus pos_embedding (B×HW×D) as the query, values, keys of the encoder, here and the self-attention mechanism of the encoder in the transformer Same. The decoder module includes two parts: the self-attention mechanism and the cross-attention mechanism. Here, DETR proposes a concept of object query (N*D, where N is a hyperparameter, 100 taken in the paper). The object query here has a certain degree It plays the role of anchor, which represents the geometric information of N boxes that may contain detection targets. In the self-attention part, the object query is used as keys and values ​​at the same time, and the output of the self-attention part is Cq. The decoder's cross-attention module uses the encoder's output Ck plus pos_embedding as keys and values, Cq plus object query as query, and outputs a vector of size (B×N×D) after cross-attention.
prediction head
We use the output of the transformer as the input of the prediction head to predict the target detection result. Here, we regard the information of N prediction boxes and the ground truth as the maximum matching of a bipartite graph, where N is usually a significantly larger than the number of ground truth. A number, and the prediction frame that fails to match is also represented as 'no object', and the loss function of the model is calculated by using the matching result. The loss function is mainly composed of two parts: class loss and box loss. Class loss applies to all prediction frames All need to be calculated, mainly using cross-entropy loss, box loss is the calculation of the predicted box that matches successfully.

The implementation process of DETR is roughly as shown above, but the convergence speed of DETR is slow and the training speed is slow. For the optimization of this point, conditional DETR and Deformable DETR are derived.

conditional DETR

Part of the main reason for the slow convergence of DETR is that it is highly dependent on the part of the query that represents content embedding.
The model structure of conditional DETR is almost the same as DETR, and its main improvement exists in the decoder part.
insert image description here
Conditional DETR decouples information into two parts: spatial information and content information. In the self-attention mechanism module, the decoder embedding in the figure is represented as the output of the previous layer of decoder, and the sum of it and the object query is used as the keys of self-attention. The main change of the model is in the cross-attention part. First, the output of the self-attention decoder is used as the context query part. In order to obtain the spatial part, a reference is first introduced, that is, s in the figure represents the center coordinate information of the prediction frame, and s can be used as a learnable parameter, or it can be obtained through object query mapping. At the same time, the output of the previous layer of decoder will contain the position information of part of the prediction frame (this paper mainly believes that the width and height relative to the reference, and the offset information of the position coordinates can be obtained), so through an FFN layer FFN(liner+relu+liner) To obtain the feature information, and then multiply it with ps to get the final spatial query. Concat the context query and spatial query as the final query of the cross-attention module. The output Ck of the encoder is used as context keys, pos embedding is used as spatial keys, and the two are used as keys and values ​​of the cross-attention mechanism through concat. The rest of the structure remains unchanged from DETR.

All in all, the core mechanism of conditional DETR is to learn a spatial query from the decoder embedding and object query. This query can help the model better locate the position of the target to be detected, thereby greatly improving the convergence and training speed.

Deformable DETR*

Another possible reason for the slow convergence of DETR training is that when the decoder predicts the cross-attention part of the detection frame, it needs to calculate the attention coefficient with all elements on the entire feature map, resulting in an increase in the amount of calculation and a slowdown in the operation speed. At the same time, due to the high Complicated calculations are accompanied by low-resolution feature maps, so a lot of image information will be lost, resulting in the unsatisfactory performance of DETR in small target detection. In view of this, the proposal of Deformable DETR can largely solve the above two problems. question.
insert image description here
In order to reduce the amount of calculation when calculating the attention coefficient, the model introduces a K, which means that only the attention coefficient of the information at the current and K positions is calculated. In the above figure, K is 3. The first is the calculation formula of the conventional multi-head attention mechanism
insert image description here

In the formula, Wm' is the mapping matrix that converts features into values, and Amqk is the attention coefficient matrix, which is usually obtained by the dot product of query and keys matrices. The calculation formula of the deformabel multi-head attention mechanism in this model is as follows:
insert image description here
Compared with the ordinary multi-head attention formula, there are two main changes in deformable detr. First, the attention coefficient will not be calculated for all pixels at all positions, but Only related to the reference point, that is, the pixel features of k offset positions (the offset position vector is obtained by mapping query features). Second, the Amqk of this model is not calculated by query and keys, but directly mapped by a query features.
The actual operation steps are to send the query features to the linear layer whose output channel is 3MK, M is heads_numbers, the first two channels 2MK are the coordinate information of the offset vector offset, and the remaining MK channel is Amqk.

In addition, Deformable DETR also adds the operation of multi-scale features, and l represents the scale level of the feature map.

insert image description here
The complete process diagram of Deformable DETR is shown below.
insert image description here

Guess you like

Origin blog.csdn.net/qq_45836365/article/details/127982053