Li Mu's intensive reading paper: DETR End to End Object Detection with Transformers

论文: End-to-End Object Detection with Transformers

Code: official code

Deformable DETR: Thesis   Code

Video: Intensive Reading of DETR Papers【Intensive Reading of Papers】_哔哩哔哩_bilibili

References in this article:

The blog of the tavern on the mountain-CSDN blog

End-to-End Object Detection DETR

        DETR (DEtection TRansformer) is a paper published on Arxiv in May 2020, which can be said to be a milestone work in the field of target detection in recent years. As can be seen from the title of the thesis, DETR has two biggest innovations: end-to-end (end-to-end) and the introduction of Transformer.

  The task of target detection has always been much more complicated than image classification, because it is necessary to predict the position and category of objects in the image. The previous mainstream target detection methods are not end-to-end target detection, regardless of proposal based method (R-CNN series), anchor based method (YOLO series), or non anchor based method (using corner/center point positioning) , will generate large and small prediction boxes, and post-processing methods such as nms (non-maximum suppression) are required to remove redundant bboxes (bounding boxes). It is precisely because of the need for a lot of manual intervention, prior knowledge (Anchor) and NMS that the entire detection framework is very complex, difficult to adjust parameters, difficult to optimize, and difficult to deploy (not all hardware supports NMS, and ordinary libraries do not necessarily support NMS needed operators). Therefore, an end-to-end target detection is what everyone has always dreamed of.

        DERT solves the above problems very well. Using Transformer's global modeling capabilities, it treats target detection as a problem of set prediction, without proposals and anchors . Moreover, due to Transformer's global modeling capabilities, DETR will not output too many redundant bounding boxes, and the output directly corresponds to the final bbox, without nms for post-processing, which greatly simplifies the training and deployment of the model.

Summary      

DETR has two innovations

  • One is the new objective function, which forces the model to generate only one prediction frame for each object by means of bipartite graph matching.
  • The second is to use Transformer's encoder-decoder architecture
    • The mechanism for generating anchors is replaced by a learnable object query. DETR can combine the learned object query with the global image information, and through continuous attention operations, the model can directly output the final prediction frame.
    • Predict boxes in parallel . Parallel output makes it faster because objects in the image have no dependencies.

        The main advantage of DETR is that it is very simple ; the performance is also good. In the COCO dataset, it can be equal to the Faster RCNN baseline network in terms of accuracy, memory, and speed. In addition, DETR can be easily extended to other tasks.

1 Introduction

end-to-end

        To put it bluntly, target detection is a set prediction problem, but now it uses indirect methods, such as proposal methods (Faster R-CNN, Mask R-CNN, Cascade R-CNN), anchors methods (YOLO, Focal loss), and There is no anchor based method (Center Net, FCOS using the center point of the object). These methods will generate redundant frames, and nms will be used, and the performance is greatly restricted by the nms operation.

        DETR uses Transformer's global modeling ability to directly treat target detection as a set prediction problem (that is, given an image, predict the set of objects of interest in the image), and transform previously unlearnable things (anchor, NMS ) into To become something that can be learned, these parts that rely on prior knowledge are deleted, and a simple and effective end-to-end network is obtained. Therefore, DETR does not need to painstakingly design the anchor, does not need NMS post-processing, and there are not so many hyperparameters to adjust, nor does it need complex operators.

DETR training process

  • Use CNN network to extract image features
  • Learning global features: Image features are pulled into one dimension, input into Transformer Encoder for global modeling, and further learn global features through self-attention
    • The self-attention mechanism in Transformer Encoder enables each point (feature) in the picture to interact with all other features in the picture, so that the model can roughly know which area is an object and which area is another object. In this way, it is possible to ensure that each object only has one prediction frame. So this global feature is very helpful for removing redundant frames.
  • Generate prediction boxes. Cooperate with learned object query, use Transformer decoder to generate N prediction frame set of box prediction (by default, N=100, that is, a fixed image generates 100 prediction frames).
  • Match the predicted box with the GT box (true box). Calculate the bipartite matching loss (bipartite matching loss), and do the loss of target detection in the matched frame.
    • The prediction box that best matches each object is selected through a bipartite graph matching algorithm. For example, if there are two objects in the above picture, then only two frames are the most matching with them, and they are classified as foreground; the remaining 98 are marked as background (no object). Finally, as in the previous object detection algorithm, the classification loss and regression loss of these two boxes are calculated.

  When reasoning, the first three steps are the same. After N prediction frames are generated by the decoder, a confidence threshold is set for filtering to obtain the final prediction frame. For example, if the threshold is set to 0.7, it means that only the prediction frame with a confidence level greater than 0.7 is output, and the rest are used as background frames.

advantage

  • Simplicity: Not only is the framework simple and end-to-end detection is possible, but also DETR can be supported as long as the hardware supports CNN and Transformer.
  • The performance on the COCO dataset is similar to a trained Faster R-CNN baseline, regardless of memory, speed or accuracy.
  • Good transferability: The DETR framework can be easily extended to other tasks, such as the effect on panoramic segmentation is very good (just add a segmentation head).

limitation

  • DETR works particularly well for large objects, but does not perform well for small objects (see Experiment 4.1).

      The former is attributed to the fact that the transformer can perform global modeling, so that no matter how large the object is, it can be detected, unlike the anchor based method, which is limited by the size of the anchor when detecting large objects. The latter is because the author only uses a simple structure, and many targeted designs for target detection have not been used, such as multi-scale features and targeted detection heads.

  • Training is too slow.

        In order to achieve good results, the author trained 500 epochs on COCO, while the general model trains dozens of epochs.

Improve

  The accuracy of DETR is only 44 AP, which is nearly 10 points worse than the SOTA model at that time, but the idea is very good, and it solves many pain points in target detection, so the impact is still great. And it's just a simple model in itself, there are many things that can be improved. For example, Deformable-DETR, which was proposed half a year later, incorporated multi-scale features, successfully solved the problem of poor detection of small objects, and also solved the problem of slow training.

  In addition, DETR is not only a target detection method, but also a highly scalable framework. Its design theory is to apply to more complex tasks, make it simpler, and even use a framework to solve all problems. There are indeed a series of improvement work based on it, such as Omni-DETR, up-DETR, PnP-DETR, SMAC-DETR, DAB-DETR, SAM-DETR, DN-DETR, OW-DETR, OV-DETR, etc., will DETR has been applied to multiple visual tasks such as target tracking, pose prediction in the video field, and semantic segmentation.  

2. Related work

This piece introduces three parts:

  • Introducing previous ensemble prediction work
  • How to use Parallel Decoding to let Transformer predict in parallel
  • Current status of object detection research

        Now the research is based on the initial prediction for detection, the two stage method is based on the proposal, and the signal stage method is based on anchors (center point of the object).

        In the past, there was also a method of collective prediction, which can achieve only one prediction frame for each object without the need for NMS. However, the performance of these methods is low, or a lot of manual intervention is added to improve performance, which is complicated.

        In the past, encoder-decoder was also used for detection, but it was all done 17 years ago. The structure of RNN was used, and the effect and performance were not good (RNN autoregressive, slow efficiency).

  Therefore, compared with previous work, it is found that the main reason for DETR to work well is the use of Transformer. For example, the above two points are because the characteristics of the backbone are not good enough, and the model performance is not good, so a lot of manual intervention is required. So the success of DETR is still the success of Transformer.

3. DETR method

3.1 Objective function based on ensemble prediction

bipartite graph matching

        The final output of the detr model is a fixed set, no matter what the picture is, there will be n outputs in the end (n=100 in this article)

        Question: detr will output 100 outputs every time, but in fact there may be only a few GT bounding boxes for a picture. How do you know which prediction box corresponds to which GT box?

        Solution: bipartite graph matching

        Suppose there are 3 workers and 4 tasks. Since each worker has different specialties, the time (cost) for them to complete different tasks is also different. How to allocate tasks to minimize the total cost? The Hungarian Algorithm is the only optimal solution to this problem that can be obtained with a lower complexity.

  In the scipy library, the Hungarian algorithm has been encapsulated, and the optimal arrangement can be obtained only by inputting the cost matrix. In the official code of DETR, this function is also called for matching (from scipy.optimize import linear_sum_assignment). The input is the cost matrix, and the output is the optimal solution.           

        In fact, the problem we are currently facing "select M boxes corresponding to GT from N predicted boxes" can also be regarded as a bipartite graph matching problem. The so-called "cost" here is the loss between each frame and the GT frame.

calculate loss

        The loss of cost matrix consists of classification loss and bounding box loss. Right now:

        Traverse all the prediction boxes, and calculate the two losses with the GT box. In fact, the way to find the optimal match is similar to the previous way of matching predictions with proposals or anchors, but here the constraints are stronger, and what is required is a "one-to-one" correspondence, that is, only one frame is forced for each object. Thus no post-processing by NMS is required.

This procedure of finding matching plays the same role as the heuristic assignment rules used to match proposal or anchors to ground truth objects in modern detectors. The main difference is that we need to find one-to-one matching for direct set prediction without duplicates

        After determining which of the generated 100 frames corresponds to the GT frame, the loss function is calculated according to the conventional target detection method:

        For the loss function, DETR has two small changes:

        One is to remove the log in the classification loss. For the classification loss (the first item), usually the target detection method needs to add log when calculating the loss, but in DETR, in order to ensure that the numerical range of the two losses is close and easy to optimize, log is chosen to be removed;

        The second is that the regression loss is L1 loss+GIOU. For the frame regression loss (the latter item), the usual method only calculates a L1 loss (the L1 loss of the predicted frame and the real frame coordinates), but the global features extracted by Transformer in DETR are more friendly to large objects, and some large frames are often produced , and the L1 loss of the large frame will be very large, which is not conducive to optimization, so the author also added a Generalized IoU loss that has nothing to do with the frame size

        So the whole steps are:

  • Traverse all prediction boxes and GT Boxes, and calculate their loss.
  • Construct the loss as a cost matrix, and then use scipy's linear_sum_assignment to find the optimal solution, that is, find the prediction box that best matches each GT Box.
  • Calculate the loss of the optimal prediction box and GT Box
  • do gradient return

3.2 DETR model architecture

        Image input size 3×800×1066

  • The backbone part is to use CNN (ResNet-50) to extract image features, and get a feature map with a size of 2048×25×34 (the length and width become 1/32), and then reduce the number of channels through a 1x1 Conv, and the feature size is 256×25×34
  • Add a position code of the same size to the CNN feature (fixed, the size is 256×25×34), straighten it to 850×256, send it to Transformer Encoder, and output the feature of 850×256;
  • Send the obtained global image features to the Transformer Decoder, and the other input is the learned object query, the size is 100×256, 256 corresponds to the feature size, and 100 is the number to be out of the frame. When decoding, the learned object query and the global image features are constantly doing across attention, and the final feature size is 100×256
    • The object query here is equivalent to the previous anchor/proposal, which is a hard condition, telling the model to only get 100 outputs in the end.
    • The learned object query is a learnable positional embedding, which is updated according to the gradient along with the model parameters
  • 100 outputs pass through the detection head (the fully connected layer FFN commonly used in target detection), and output 100 prediction boxes (xcenter​, ycenter​, w, h) and corresponding categories.

The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer. The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function.

  • Use the bipartite graph matching method to output the final prediction frame, and then calculate the loss of the prediction frame and the real frame, and update the network by gradient return.

        In addition, there are some details:

  • Transformer-encode/decoder has 6 layers
  • Except for the first layer, each layer of Transformer encoder will first calculate the self-attention of the object query, mainly to remove redundant frames. After these queries interact, you probably know what kind of frame each query will produce, and they will not repeat each other (see experiment).
  • The decoder adds auxiliary loss, that is, a 6-layer decoder, and the 100×256 output of each layer is sent to the FFN to get the output, and then the loss is calculated, so that the model converges faster. Each layer of FFN shares parameters

We add prediction FFNs and Hungarian loss after each decoder layer. All predictions FFNs share their parameters. We use an additional shared layer-norm to normalize the input to the prediction FFNs from different decoder layers.

the code

In order to illustrate the simplicity of the end-to-end DETR framework, the author gives the DETR model definition and reasoning code at the end of the paper, with a total of less than 50 lines. Of course, this version lacks some details, but it can fully show the process of DETR. This version is directly used for training, and finally can reach an AP of 40, which is two APs worse than the DERT baseline model.

import torch
from torch import nn
from torchvision.models import resnet50

class DETR(nn.Module):
    def __init__(self, num_classes, hidden_dim, nheads,
        num_encoder_layers, num_decoder_layers):
        super().__init__()       
        self.backbone = nn.Sequential(*list(resnet50(pretrained=True).children())[:-2])  # We take only convolutional layers from ResNet-50 model
        self.conv = nn.Conv2d(2048, hidden_dim, 1) # 1×1卷积层将2048维特征降到256维
        self.transformer = nn.Transformer(hidden_dim, nheads, num_encoder_layers, num_decoder_layers)  # 输出100×256
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1) # 类别FFN
        self.linear_bbox = nn.Linear(hidden_dim, 4)                # 回归FFN
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim)) # object query
        # 下面两个是位置编码
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    def forward(self, inputs):
        x = self.backbone(inputs)
        h = self.conv(x)
        H, W = h.shape[-2:]
        pos = torch.cat([self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
                         self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
                         ], dim=-1).flatten(0, 1).unsqueeze(1) # 位置编码                                           
        h = self.transformer(pos + h.flatten(2).permute(2, 0, 1), self.query_pos.unsqueeze(1)) 
        return self.linear_class(h), self.linear_bbox(h).sigmoid()


detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
detr.eval()
inputs = torch.randn(1, 3, 800, 1200)
logits, bboxes = detr(inputs)

        Why +1 for the predicted category:

Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅ is used to represent that no object is detected within a slot. This class plays a similar role to the “background” class in the standard object detection approaches.

4. Experimental part

performance comparison

        The following table gives the quantitative performance comparison of DETR and the baseline Faster RCNN.

        The performance result of Faster RCNN in the top part is the implementation of Detection2. The reason why Faster RCNN is divided into two parts is that many new training tricks have been used in DETR in recent years, such as GIoU loss, stronger data enhancement strategy, longer The training time, so the author's team added these strategies to retrain Faster RCNN for a fair comparison.

  • The new training strategies in recent years have significantly improved the target detection model.

        Comparing the first and second parts of the table, the retrained model is represented by +, but with a better training strategy, it can basically increase by two points stably.

  • The model is a little more accurate than Faster RCNN, and the main large object detection is excellent

        The last three columns of the table are the detection performance of small, medium and large objects respectively. It can be observed that DETR is better in the detection of large objects (increased by 6 AP), but the detection of small objects is even far inferior to Faster RCNN (reduced 4 AP or so).

  • The author believes that thanks to the global modeling capability of the Transformer structure, and there is no limitation of the preset fixed anchor, the prediction frame can be as big as it wants , and it is more friendly to large objects.
  • DETR does not perform well on small objects because the DETR model in this article is still a relatively simple model, and there are not many optimized designs for target detection, such as FPN design for small objects and multi-scale.
  • There is no necessary relationship between the amount of parameters, the amount of computation, and the speed of inference.

        #params, GFLOPS, and FPS represent the amount of model parameters, calculation amount, and inference speed, respectively. The amount of parameters and GFLOPS of the DETR model are smaller, but the inference is slower. It may be due to the difference in the degree of optimization of the hardware for different structures. At present, CNN has a faster reasoning speed than Transformer at the same network scale or even a larger network scale.

Encoder/decoder layer ablation test

The result is that the more layers, the better the effect, but considering the amount of calculation, the author finally chooses 6 layers. In fact, the 3rd floor is about the same

visualization

Encoder Self-Attention Map Visualization

        The figure below shows the visualization of the Encoder attention heat map for a set of reference points (red dots in the figure), that is, calculating the size of the self-attention value of the reference point and all other points in the image.

        It can be observed that the Transformer Encoder can basically distinguish each object very clearly, and even the heat map already has the meaning of a mask map for instance segmentation. In the case of a certain occlusion (the two cows on the left), it can also be clearly separated which is which.

        This effect is brought about by the global modeling capability of Transformer Encoder. Each position can perceive all other positions in the image, so it can distinguish different objects in the image. On this basis, only one The prediction frame will be much simpler and the effect will be better.

Decoder Attention Map Visualization

        Through the previous visualization, we have seen that the Encoder has learned a global feature, and has basically been able to distinguish different objects in the picture. But for target detection, precise bounding box coordinates of the object are also required, and this part is done by Decoder.

        The figure below visualizes the attention of different objects in the Decoder feature. For example, the two elephants in the left picture are represented by blue and orange respectively. It can be observed that the attention of each object in the Decoder network is concentrated on the boundary position of the object, such as the elephant's nose, tail, elephant legs, etc., and DETR can still distinguish the outline of each zebra and learn each target's stripes. The author believes that this is because the Decoder is distinguishing the extreme points (extremities) of the boundaries of different objects. The encoder learns a global feature to separate the objects as much as possible . The Decoder pays attention to the specific positions of the boundaries of different objects, and finally accurately predicts the different The bounding box position of the object.

object query visualization

        The figure above shows the visualization of all the prediction boxes in the COCO2017 verification set about the learned object query. The operator takes out 20 of the 100 object queries, each square represents an object query, and each point represents the normalized center coordinates of a prediction frame. Each object query is equivalent to a person asking a question. Green dots represent small boxes, red dots represent large horizontal boxes, and blue dots represent large vertical boxes.

        It can be seen that each learned object query actually learns a mode of "querying" objects. For example, the first query below is responsible for querying whether there is a small object in the lower left corner of the picture, and whether there is a large horizontal object in the middle; the second query is responsible for querying whether there is a small object on the right, and it is completed after 100 different object query queries After that, the detection of the target is completed. From this visualization experiment, we can see that what the learned object query does is similar to the anchor, which is to see if there is a certain object at a certain position, but the anchor needs to be manually set a priori, and the query is connected with the network. end-to-end learning.

        You can also see from the above picture that there is a red vertical line in the center of each picture, indicating that each query will detect whether there is a large horizontal object in the center of the picture. This is because there is often a large object in the center of the COCO dataset picture, and the query has learned this pattern, or distribution.

5 Conclusion

  • DETR replaces the anchor setting with the learned object query, replaces the NMS post-processing with the bipartite graph matching method, and replaces the previously unlearnable steps with learnable content, thus realizing an end-to-end target detection network.
  • With Transformer's global feature interaction capability, it is possible to directly output a reliable prediction result for each object "one-to-one".
  • It is on par with the Faster R-CNN baseline model on the COCO dataset and achieves better results on panoptic segmentation tasks.
  • DETR works really well on large objects
  • The main advantage is simplicity, which has great potential to be applied to other tasks.
  • Disadvantages: The reasoning time is a bit long, due to the use of Transformer, it is not easy to optimize, and the performance on small objects is also poor. Later, Deformable DETR addressed the shortcomings of inference time and poor detection of small objects.

        Although the detection performance of DETR itself is not outstanding, because it actually solved some pain points in the field of target detection, a new end-to-end detection framework was proposed, and then a series of follow-up work improved its performance.

        后续工作:omni-DETR up-DETR PnP-DETR SMAC-DETR Deformer-DETR DAB-DETR SAM-DETR DN-DETR OW-DETR OV-DETR

Guess you like

Origin blog.csdn.net/iwill323/article/details/128450164