YOLO series target detection algorithm - YOLOS

YOLO series target detection algorithm catalog - article link


This article summarizes:

  1. Can the Transformer perform 2D object-level recognition from a pure seq-to-seq perspective (transfer directly from image recognition to object detection) with little knowledge of the 2D spatial structure? Starting from this problem, YOLOS was proposed;
  2. YOLOS is not a high-performance target detector, but to reveal the versatility and transferability of Transformer from image recognition to target detection;
  3. We use the moderately sized ImageNet-1k as the only pre-training dataset and show that vanilla ViT (DeiT) can be successfully transferred to perform object detection tasks and produce competitive COCO results with as little modification as possible, i.e. only look at a sequence (YOLOS);
  4. We demonstrate that 2D object detection can be done in a pure seq-to-seq fashion by taking as input a fixed-size sequence of non-overlapping image patches. Among existing object detectors, YOLOS uses the smallest inductive bias. Furthermore, it is feasible for YOLOS to perform object detection in any dimensional space where the precise spatial structure or geometry is not known;
  5. For ViT(DeiT), it is found that the object detection results are quite sensitive to the pre-training scheme, and the detection performance is far from being saturated. Therefore, the proposed YOLOS can serve as a challenging benchmark task to evaluate different pre-training strategies for ViT (DeiT).

Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



YOLO series target detection algorithm-YOLOS

2021.6.1 YOLOS:《YOLOS:You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection》

Transformer-related knowledge can refer to the article "From RNN to Attention to Transformer Series - Transformer Introduction and Code Implementation" , or the series of articles "From RNN to Attention to Transformer" .

1 Introduction

  At present, Transformer is widely used. The newly proposed ViT has applied Transformer to the image field. However, these architectures based on ViT are performance-oriented and cannot reflect the characteristics directly inherited from plain or ordinary Vision Transformer.

  Intuitively, ViT is designed to model long-term dependencies and global context information instead of local and region-level relations. Furthermore, ViT lacks a hierarchical architecture like CNNs to handle large variations in the scale of visual entities. Based on the available evidence, it is unclear whether pure ViT can transfer pre-trained general visual representations from image-level recognition to more complex 2D object detection tasks.

  To answer this question, this paper proposes "You Only Look at One Sequence (YOLOS)", a family of object detection models based on a canonical ViT architecture with minimal modifications and induced biases. The change from ViT to YOLOS detector is simple:

  1. YOLOS deletes the [CLS] token in ViT and adds 100 learnable [DET] tokens to the input sequence for target detection;
  2. YOLOS replaces image classification loss with bipartite matching loss to perform target detection post-prediction, which avoids reinterpreting the output ViT sequence to a two-dimensional feature map, and prevents manual injection of the target two-dimensional space during label assignment. Heuristics and prior knowledge of structures.

  YOLOS is directly inherited from ViT. It is not designed as another high-performance target detector, but reveals the versatility and transferability of Transformer from image recognition to target detection.

2. YOLOS network structure

  In model design, YOLOS strictly follows the original ViT architecture and optimizes target detection. YOLOS can be easily adapted to various Transformer variants in NLP and computer vision. This intentionally simple setting is not designed for better detection performance, but to reveal the characteristics of the Transformer family in object detection as accurately as possible.

  The overall structure of the network structure is shown in Figure 1. The change from ViT to YOLOS detector is very simple, as follows: "Pat-Tok" in the figure refers to the patch token, which is the encoding of a flattened image patch; "Det-Tok" Tok" refers to [DET]token, which is a learnable encoding vector of object detection prediction results; "PE" refers to positional encoding. During training, YOLOS outputs the maximum matching of the optimal bipartite graph generated between 100 [DET] tokens and GT; when predicting reasoning, YOLOS directly outputs the final prediction results in parallel.
insert image description here

  1. YOLOS deletes the [CLS] token used for image classification, and appends 100 randomly initialized [DET] tokens to the embedded sequence of the input patch for target detection;
  2. During the training process, YOLOS replaces the image classification loss in ViT with bipartite matching loss, and performs target detection according to the set prediction method of "End-to-end object detection with transformers" by Carion et al.

Detection Token
  purposefully selects randomly initialized [DET] tokens as proxies for target representations to avoid the inductive bias of 2D structures and prior knowledge about tasks injected during label assignment. When fine-tuning on COCO, for each forward pass, an optimal bipartite graph maximal matching between the predictions generated by [DET] tokens and the ground-truth targets of GT is established. This process plays the same role as label assignment, but without knowing the 2D structure of the input, YOLOS does not need to reinterpret the output sequence of ViT as a 2D feature map for label assignment. In theory, it is feasible for YOLOS to perform object detection in any dimension without knowing the exact spatial structure and geometry, as long as the input for each pass is always flattened into a sequence in the same way.

Fine-tuning at higher resolutions
  When fine-tuning on COCO, all parameters are pre-trained from ImageNet-1k except the MLP header for classification and bounding box regression and a hundred [DET] tokens initialized randomly Initialized in weights. Both classification and bounding box regression heads are implemented by an MLP with two hidden layers, using separate parameters. During fine-tuning, the resolution of images is much higher than before training, keeping the patch size the same (16×16), resulting in a larger effective sequence length. While ViT can handle arbitrary sequence lengths, positional embeddings need to accommodate longer input sequences. 2D interpolation using pre-trained positional embeddings.

Inductive Bias

  Inductive Bias refers to the inductive bias that allows the algorithm to prioritize a certain solution, which is independent of the observed data. Common inductive biases include: prior distribution in the Bayesian algorithm, using some regular terms to punish the model, designing a special network structure, etc. A good inductive bias will improve the efficiency of the algorithm in searching for solutions (while not degrading performance), while a bad inductive bias will make the algorithm fall into a suboptimal solution because it imposes too strong restrictions on the algorithm. The full connection assumes that all units may be connected, the convolution assumes that the characteristics of the data are local and translation invariant, the recurrent neural network assumes that the data has sequence correlation and time sequence invariance, and the graph neural network assumes It is assumed that the aggregation method of the characteristics of the nodes is consistent. In short, the structure of the network itself contains the assumptions and preferences of the designer, which is the inductive bias. The introduction of inductive bias in this paragraph is quoted from url .

  YOLOS is designed in this paper to achieve minimal additional inductive bias. The inherent nanobias of ViT comes from the patch extraction of the network stem and the resolution adjustment of the position embedding. In addition, YOLOS adds non-degenerate (such as 3×3 or other non-1×1) convolutions to ViT. From a representation learning perspective, we choose to use the [DET] token as the proxy for the final predicted target to avoid additional 2D inductive biases and heuristics. Performance-oriented designs inspired by CNN architecture, such as pyramidal feature hierarchy, 2D local spatial attention, and regional pooling operations are not applied. All these efforts are aimed at accurately revealing the versatility and transferability of Transformers from image recognition to object detection in a pure seq-to-seq fashion with minimal knowledge of the spatial structure and geometry of the input.

Comparison with DETR
  The design of YOLOS is inspired by DETR: YOLOS uses the [DET] token as a proxy for target representations to avoid inductive bias on 2D structures and prior knowledge of tasks injected during label assignment, whereas The optimization method of YOLOS is similar to that of DETR. At the same time, there are some key differences between the two:

  1. DETR uses a randomly initialized encoder-decoder architecture, while YOLOS studies the portability of pre-trained encoder-only ViT;
  2. DETR uses decoder-encoder attention (cross attention) between image features and object queries, and a deeply supervised auxiliary decoding loss at each decoder layer, while YOLOS always only looks at one sequence per layer, and does not in In terms of operation, distinguish between patch token and [DET] token.

3. Code

  updating. . .

4 Conclusion

  In this paper, we explore the transferability of vanilla ViT pretrained on the medium-sized ImageNet-1k dataset to the more challenging COCO object detection benchmark. We demonstrate that 2D object detection can be done in a pure seq-to-seq fashion with minimal additional inductive bias. The performance on COCO is still ok. These preliminary results confirm that it is meaningful, indicating the transferability and generality of Transformer for various downstream tasks.

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/128340092