End-to-End Object Detection with Transformers (DETR) paper reading and understanding

Paper title: End-to-End Object Detection with Transformers
Paper link: DETR

Summary:

  A new method is proposed, which directly regards target detection as a set prediction problem (in fact, no matter the proposal, anchor, window centers method is essentially a set prediction method, using a large amount of prior knowledge to manually intervene, such as NMS); and DETR is pure end-to-end, and the entire training does not require manual intervention in advance. DETR training steps include: (1) CNN extracts features for the backbone (2) Transformer encoder learns global features (3) Transformer decoder generates prediction boxes (100) (4) Matches prediction boxes with GT . (4) is not needed during inference, after the prediction frame is generated, it is judged by the threshold and then output.
insert image description here

Introduction

  DETR adopts a transformer-based encoder-decoder structure, using the self-attention mechanism to display the interaction between all elements in the encoding sequence. The advantage of this is that the redundant boxes generated in the target detection can be deleted! ! , DETR predicts all objects at once and is trained end-to-end with an ensemble loss function that performs bipartite graph matching between predictions and ground truth. ( For example, 100 frames are generated, GT is 10, and the 10 frames with the most matching prediction and GT are calculated. We consider them to be positive samples, and the remaining 90 are negative samples, so NMS is not needed )

related work

The DETR work builds on: (1) bipartite graph matching loss for ensemble prediction (2) transformer encoder-decoder (3) parallel decoding and object detection.

(1) Ensemble prediction

At present, most detectors need to use post-processing NMS to remove redundant boxes. But direct ensemble prediction does not require post-processing, and the global inference mode to simulate the interaction between all prediction elements can avoid redundancy. For the set prediction of constant sets, MLP is ok (violence: each is considered a matching degree), but the cost is high. The usual solution is to devise a loss based on the Hungarian algorithm to find a bipartite match between the true value and the predicted value.

(2) Transformer structure

The attention mechanism can aggregate information from the entire input sequence. One of the main advantages of self-attention-based models is their global computation and perfect memory. (feel wild!)

(3) Target detection

Most target detection is mainly based on proposal, anchor, and window center for prediction, but due to the generation of many redundant frames, NMS has to be used here. And DETR can remove NMS to achieve end-to-end.

DETR model

In object detection models, two factors are crucial for direct ensemble prediction:
(1) ensemble prediction loss, which enforces a guaranteed unique match between predictions and GTs
(2) predicting a set of objects at once and relational modeling

Ensemble prediction loss

In a decoder process, DETR recommends a prediction set of N results, and N is similar to assigning N frames to pictures. One of the main difficulties in training is to score predicted objects (category, location, size) against the ground truth. The paper sets the best bipartite matching between predicted objects and GTs, and then optimizes the loss for specific object bounding boxes.

  1. The first step is to get the only bipartite matching cost, in order to find a bipartite match between the predicted frame and the real frame (similar: 3 workers, A is suitable for 1 task, B is suitable for 2 tasks, C is suitable for 3 tasks, how to allocate minimum cost), looking for an arrangement of N elements that minimizes overhead. (i.e. optimal allocation)
    previous work, Hungarian algorithm

    paper : each element i in the truth set can be regarded as yi = (ci, bi), ci is the target class label (note that it may be ∅); bi is A vector [0, 1]^4 that defines the center coordinates of the b-box and its height and width relative to the size of the image.
    insert image description here
  2. The second step is to calculate the loss function, which is the Hungarian loss for all matching pairs from the previous step. Our definition of loss is similar to the loss of general object detectors, that is, the linear combination of the negative log-likelihood of class prediction and the detection box loss Lbox defined later:

insert image description here
3. Bounding box loss:
Scoring bounding boxes. Unlike many detectors that make bounding box predictions with some initial guesswork, we do bounding box predictions directly. While this approach simplifies the implementation, it raises the issue of relative scaling losses. The most commonly used L1 loss has different scales for small bounding boxes and large bounding boxes, even though their relative errors are similar. To alleviate this problem, we use a linear combination of L1 loss and scale-invariant generalized IoU loss.
insert image description here

DETR framework

insert image description here

backbone:

Initial image: H * W * 3, a low-resolution feature map is generated by traditional CNN (H/32, W/32, 2048)

Transformer encoder:

First, use 1*1 conv to reduce the dimension to (H/32, W/32, 256), and pull the feature map into a sequence, which is convenient as the input of the transformer. Each encoder has a standard structure, which consists of a multi-head self-attention module. and a FFN (MLP). Since the transformer architecture is order-insensitive, we complement it with a fixed-position encoding , which is added to the input of each attention layer.

Transformer decoder:

The decoder follows the standard architecture of transformers, converting to N embeddings of size d. Unlike the original Transformer, each decoder decodes N objects in parallel. Since the decoder is invariant, the N input embeddings must be different to produce different results. These input embeddings are the learned position codes and become object queries. Like the encoder, add it to the decoder. Then through FFN, they are independently decoded into box coordinates and class labels, and finally N predictions are obtained. Using self-attention and an encoder-decoder on these embeddings, the model exploits pairwise relationships between all objects for global reasoning, while being able to use the entire image as context! !

FFN:

The final prediction is computed by a ReLU, 3 layers of MLP and a linear projection layer. The FFN predicts the normalized center coordinates of the input image, the height and width of the box, and the linear layer uses the softmax function to predict the class label. Since a set of bounding boxes of fixed size N is predicted, where N is usually much larger than the number of GTs in the image, an additional special class label is used to indicate that no object is detected at this location.
Auxiliary encoding losses are helpful in the decoder during training, especially in helping the model output the correct number of objects of each class, adding predictive FFS and Hungarian losses after each encoder layer. All prediction FFNs share parameters.

Guess you like

Origin blog.csdn.net/weixin_45074568/article/details/125542403