DETR - the beginning of end-to-end object detection using Transformer


Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。


From RNN to Attention to Transformer series

Column link:
https://blog.csdn.net/qq_39707285/category_11814303.html

此专栏主要介绍RNN、LSTM、Attention、Transformer及其代码实现。


YOLO series target detection algorithm

Column link:
https://blog.csdn.net/qq_39707285/category_12009356.html

此专栏详细介绍YOLO系列算法,包括官方的YOLOv1、YOLOv2、YOLOv3、YOLOv4、Scaled-YOLOv4、YOLOv7,和YOLOv5,以及美团的YOLOv6,还有PaddlePaddle的PP-YOLO、PP-YOLOv2等,还有YOLOR、YOLOX、YOLOS等。


Visual Transformer

Column link:
https://blog.csdn.net/qq_39707285/category_12184436.html

此专栏详细介绍各种Visual Transformer,包括应用到分类、检测和分割的多种算法。



DETR
The beginning of applying Transformer to end-to-end object detection. paper: "End-to-End Object Detection with Transformers"

1 Introduction

This paper proposes a new method of object detection—taking object detection as a direct ensemble prediction problem (that is, the model directly outputs a set of prediction box coordinates and categories). This method simplifies the detection process and removes many manually designed components such as NMS and anchor, because these components obviously encode some prior knowledge for solving tasks, and need to be redesigned for different tasks.

The algorithm proposed in this paper is named DEtection TRansformer (DETR), which mainly has two parts, a set-based global loss (forcibly unique prediction through binary matching) and a Transformer encoder-decoder architecture. Given a fixed set of small learnable object queries, DETR analyzes the relationship between the object and the global image context, and directly outputs the final prediction set in parallel. Unlike many other modern detection algorithms, this algorithm is simple and does not require other specialized libraries. See the implementation code at the end of the article for details.

2. Relevant knowledge

The construction of this algorithm requires knowledge in some fields, including: bipartite matching loss for set prediction, encoder-decoder structure based on Transformer, parallel decoding and target detection methods, etc.

2.1 Set Prediction (set prediction)

The first difficulty faced by many algorithms at present is to avoid near-duplicates (approximate repetitions). Most algorithms use NMS to deal with this problem, but direct set prediction has no post-processing. Set prediction requires modeling the interactions between all prediction elements to avoid redundant global reasoning schemes. For fixed-size set prediction, dense fully-connected networks are sufficient, but at a high cost. A general approach is to use autoregressive sequence models, such as recurrent neural networks. In all cases, the loss function should remain constant for different permutations of predictions. The usual solution is to design the loss according to the Hungarian algorithm, finding a bipartite match between GT and prediction. This enforces permutation invariance and ensures that each target element has a unique match. In this paper, the bipartite matching loss method is adopted. However, in contrast to most previous work, this paper no longer uses an autoregressive model, but a Transformer with parallel decoding.

2.2 Transformer and parallel decoding

The introduction of Transformer can refer to this article. Attention mechanisms are neural network layers that aggregate information from the entire input sequence. Transformer introduces a self-attention layer, similar to a non-local neural network (Non-Local NN), which scans each element of a sequence and updates it by aggregating information from the entire sequence. One of the main advantages of attention-based models is their global computation and perfect memory, which makes them more suitable for long sequences than RNNs.

Transformer is first used in autoregressive models, followed by early sequence-to-sequence models, generating output tokens one by one. However, its inference cost is high (proportional to the output length, and difficult to batch), and this paper combines Transformer and parallel decoding to strike an appropriate trade-off between computational cost and the ability to perform the global computation required for ensemble prediction.

2.3 Object Detection

In our model, the manual design process is removed and the detection process is simplified by directly predicting the detection set using absolute box predictions relative to the input image instead of anchors.

  1. Set-based loss
    In early deep learning models, the relationship between different predictions is only modeled with convolutional or fully connected layers, and hand-designed NMS post-processing can improve their performance. Recent detectors use non-unique assignment rules between GTs and predictions, and use NMS.
    Learnable NMS methods and relational networks explicitly model attention on the relationship between different predictions. They use a direct ensemble loss, they do not require any post-processing steps. However, these methods use other hand-crafted contextual features, such as proposal box coordinates, to effectively model the relationship between detections, while this paper seeks solutions to reduce the prior knowledge encoded in the model.

  2. Cycle Detectors
    The closest approach to this paper is end-to-end ensemble prediction for object detection and instance segmentation. Similar to this paper, they use a bipartite matching loss based on an encoder-decoder architecture with CNN activations to directly generate a set of bounding boxes. However, these methods have only been evaluated on small datasets and not against modern benchmarks. In particular, they are based on autoregressive models (more precisely RNNs), so they do not take advantage of Transformers for parallel decoding.

3. DETR model

Two components are crucial for direct ensemble prediction in detection: (1) an ensemble prediction loss that enforces unique matches between predictions and GTs; (2) predicting (in one pass) a set of objects and aligning their relations Modeled architecture. The overall structure diagram is as follows:
insert image description here

3.1 Target detection set prediction loss

DETR infers a fixed-size set of N predictions in one pass through the decoder, where N is significantly larger than the number of objects in the image. One of the main difficulties in training is scoring predicted objects (category, location, size) according to GT. The loss in this paper is a loss that generates the best bipartite match between predicted and GT objects, and then optimizes for a specific object (bounding box).

Use y to represent the GT target set, y ^ = { y ^ i } i = 1 N \hat y = \{\hat y_i\}^N_{i=1}y^={ y^i}i=1NRepresents the set of predictions, N is greater than the number of objects in the picture, filled with (no object,⊘)y to make the total number equal to N. To find a bipartite match between these two sets, search for combinations of N elements σ ∈ ð N \sigma \in \eth_NpðN,match cost:
insert image description here
matchζ match ( yi , y ^ σ ( i ) ) \zeta_{match}(y_i,\hat y_{\sigma(i)})gmatch(yi,y^σ ( i ))是GTyi y_iyiand position is σ ( i ) \sigma(i)The pairwise matching cost between predictions for σ ( i ) . This optimal allocation can be computed efficiently using the Hungarian algorithm.
The matching cost takes into account both the category prediction and the similarity between the prediction and the GT box. Each element i of the GT set can be regarded asyi = ( ci , bi ) y_i=(c_i,b_i)yi=(ci,bi) , among whichci c_iciis the target class label (may be ⊘), and bi ∈ [ 0 , 1 ] 4 b_i \in[0,1]^4bi[0,1]4 is a vector that defines the GT box center coordinates and its height and width relative to the image size. For positionσ ( i ) \sigma(i)σ ( i ) prediction, willci c_iciThe class probability is defined as p ^ σ ( i ) ( ci ) \hat p_{\sigma(i)}(c_i)p^σ ( i )(ci) , the predicted frame isb ^ σ ( i ) \hat b_{\sigma(i)}b^σ ( i ). Through these symbols, ζ match ( yi , y ^ σ ( i ) ) \zeta_{match}(y_i,\hat y_{\sigma(i)}) is definedgmatch(yi,y^σ ( i )) equals:
insert image description here
This process of finding matches plays the same role as the heuristic assignment rules used in modern detectors to match proposal boxes or anchors to GT objects. The main difference is that we need to find one-to-one matches for duplicate-free direct-set predictions.

The second step is to calculate the loss function, which is the Hungarian loss of all pairs matched in the previous step. The loss defined in this paper is similar to the loss of ordinary object detectors, that is, a linear combination of negative log-likelihood and box loss for category prediction:
insert image description here
where σ ^ \hat \sigmap^ is the optimal assignment computed in the first step (Equation 1), in practice, weighting down the log probability term,ci = ⊘ c_i=⊘ci= Divide by 10 by 10 to balance categories. Note that the matching cost between the target and ⊘ does not depend on the prediction, which means that in this case, the cost is constant. In matching cost, use probabilityp ^ σ ( i ) ( ci ) \hat p_{\sigma(i)}(c_i)p^σ ( i )(ci) instead of the logarithmic probability. This makes the class predictor withζ box ( ⋅ , ⋅ ) \zeta_{box}(·,·)gbox(⋅,⋅) is commensurate.

Bounding box loss: The second part of the matching cost and Hungarian loss is ζ box ( ⋅ ) \zeta_{box}(·) which
scores the bounding boxgbox() . Unlike many detectors, which predict boxes as ∆wrt, we directly make predictions. While this approach simplifies the implementation, it poses problems with the relative size of the losses. The most commonly used l1 loss will have different scales for small boxes and large boxes, even though their relative errors are similar. To alleviate this problem, we use a linear combination of l1 loss and generalized IoU loss,ζ box ( ⋅ , ⋅ ) \zeta_{box}(·,·)gbox(⋅,⋅) is scale invariant. Overall, the loss isζ box ( bi , b ^ σ ( i ) ) \zeta_{box}(b_i,\hat b_{\sigma(i)})gbox(bi,b^σ ( i )) , by definition:
insert image description here
whereλ iou , λ L 1 ∈ ℜ \lambda_{iou},\lambda_{L_1} \in \Reliou,lL1 is a hyperparameter. These two losses are normalized by the number of objects in the batch.

3.2 DETR structure

3.2.1 backbone

Input image size: ximg ∈ ℜ 3 × H 0 × W 0 x_{img} \in \Re^{3×H_0×W_0}ximg3×H0×W0Generate a low-resolution feature map through a convolutional backbone: f ∈ ℜ C × H × W f \in \Re^{C×H×W}fC × H × W , C is set to 2048,H , W = H 0 32 , W 0 32 H,W= \frac {H_0}{32},\frac {W_0}{32}H,W=32H0,32W0

3.2.2 Transformer encoder

First, the 1x1 convolution reduces the channel dimension of the feature map f from C to a smaller dimension d, and obtains a new feature map z 0 ∈ ℜ d × H × W z_0 \in \Re^{d×H×W }z0d × H × W , the encoder expects a sequence as input, so the spatial dimension of z0 is collapsed into one dimension, resulting in ad × H × W d×H×Wd×H×Feature map of W. Each encoder layer has a standard architecture consisting of a multi-head self-attention module and a feed-forward network (FFN). Since the Transformer structure is translation invariant, it is supplemented with a fixed positional encoding, added to the input of each attention layer.

3.2.3 Transformer decoder

The decoder follows the standard architecture of Transformer, using multi-head self-attention and encoder-decoder attention mechanisms to transform N embeddings of size d. Unlike the original Transformaer, where each decoder layer decodes N targets in parallel, Vaswani et al. use an autoregressive model that predicts the output sequence one element at a time. Since the decoder is also permutation invariant, the N input embeddings must be different to produce different results. These input embeddings are learned positional encodings, called target queries, which are added to the input of each attention layer similarly to the encoder. The decoder converts N target queries into output embeddings. These are then independently decoded into box coordinates and class labels by a feed-forward network (described in the next subsection), resulting in N final predictions. With self-attention and encoder-decoder attention to these embeddings, the model uses the pairwise relationship between them to globally reason about all objects while being able to use the entire image as context.

3.2.4 Prediction feed-forward networks (FFNs)

The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d and a linear projection layer. The FFN predicts the normalized center coordinates, height, and width of the input image box, and the linear layer uses a softmax function to predict the class label. Since a fixed-size set of N bounding boxes is predicted, where N is usually much larger than the actual number of objects in the image, an additional special class label ⊘ is required for use when no objects are detected. This class plays a role similar to the "background" class in standard object detection methods.

3.2.5 Auxiliary decoding loss

This paper finds it helpful to use an auxiliary loss in the decoder during training, in particular to help the model output the correct target digits for each class. Predictive FFN and Hungarian loss are added after each decoder layer. All prediction FFNs share their parameters. An additional shared layer specification is used to normalize the input of the predicted FFN from different decoder layers.
insert image description here

3.3 Detailed structure of DETR

The figure below gives a detailed description of the variable Transformer used in DETR, and the positional encoding passed in each attention layer. Image features from the CNN backbone are added to the queries and keys of each multi-head self-attention layer through the Transformer encoder, along with the spatial position encoding. Then, the decoder receives queries (queries, initially set to zero), output position encodings (target queries) and encoder memory, and generates predicted class labels and bounding boxes through multiple multi-head self-attention and decoder-encoder attention final collection of . The first self-attention layer in the first decoder layer can be skipped.

4 Conclusion

This paper proposes DETR, a novel design of an object detection system based on Transformer and bipartite matching loss for direct set prediction. The method achieves comparable results to optimized Faster R-CNN on the challenging COCO dataset. DETR is easy to implement and has a flexible architecture that can be easily extended to panoptic segmentation with competitive results. In addition, compared with Faster R-CNN, its performance on large objects is significantly better, which may benefit from the processing of global information by self-attention attention.
This newly designed detector also poses new challenges, especially in terms of training, optimization, and performance for small objects.

5. Implement the code

import torch
from torch import nn
from torchvision.models import resnet50


class DETR(nn.Module):
    def __init__(self, num_classes, hidden_dim, nheads, num_encoder_layers, num_decoder_layers):
        super().__init__()
        # We take only convolutional layers from ResNet-50 model
        self.backbone = nn.Sequential(*list(resnet50(pretrained=False).children())[:-2])
        self.conv = nn.Conv2d(2048, hidden_dim, 1)
        self.transformer = nn.Transformer(hidden_dim, nheads, num_encoder_layers, num_decoder_layers)
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
        self.linear_bbox = nn.Linear(hidden_dim, 4)
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    def forward(self, inputs):
        x = self.backbone(inputs)
        h = self.conv(x)
        H, W = h.shape[-2:]
        pos = torch.cat([
            self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
            self.row_embed[:H].unsqueeze(1).repeat(1, W, 1), ], dim=-1).flatten(0, 1).unsqueeze(1)
        h = self.transformer(pos + h.flatten(2).permute(2, 0, 1), self.query_pos.unsqueeze(1))
        return self.linear_class(h), self.linear_bbox(h).sigmoid()


if __name__ == "__main__":
    detr = DETR(num_classes=91, hidden_dim=256, nheads=8, num_encoder_layers=6, num_decoder_layers=6)
    detr.eval()
    inputs = torch.randn(1, 3, 800, 1200)
    logits, bboxes = detr(inputs)
    print(logits.shape)
    print(bboxes.shape)

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/128849450