Improve YOLO | Maybe this is the correct way to open YOLO with Transformer?

Click on "Computer Vision Workshop" above and select "Star"

Dry goods delivered as soon as possible

3e5b16b76cba04847802c566d0c484b8.png

Author: ChaucerG

Source丨Jizhi Shutong

f85f2a5de4967818ebcf9b60689fa761.png

The limitation of current state-of-the-art One-Stage object detectors is that they only process each image region separately, without considering the possible relationships that exist between objects. This results in the model relying only on high-quality convolutional features to detect objects. However, sometimes this may not be possible due to some challenging conditions.

This paper analyzes the application of inference features in One-Stage object detection. The authors experimented with different architectures, leveraging self-attention to explain the relationship between image regions. The YOLOv3-Reasoner2 model spatially and semantically enhances the features in the reasoning layer, and fuses them with the original convolutional features to improve performance. The YOLOv3-Reasoner2 model achieves about 2.5% improvement on COCO over the YOLOv3 Baseline.

1 Introduction

The goal of object detection is to classify and locate objects of interest in a given image. It has attracted great attention from all walks of life due to its close connection with other computer vision applications. Before major breakthroughs in the field of deep learning, many traditional methods have been proposed to solve the problem of object detection. These methods are built on hand-crafted feature representations. The inevitable reliance on handcrafted features limits the performance of traditional methods.

The huge influence of AlexNet has made object detection methods take on a new look, and methods based on deep learning have completely dominated the research of object detection. Detectors based on deep learning can be divided into Two-Stage object detectors and One-Stage object detectors. The inference speed of the Two-Stage object detector is low since the intermediate layers are used to propose possible object regions. The region proposal layer extracts target regions in the first stage. In the second stage, these proposed regions are used for classification and bounding box regression. On the other hand, the One-Stage detector can predict all bounding boxes and class probabilities in one inference with high inference speed. This makes the One-Stage object detector more suitable for real-time applications.

Recent One-Stage object detectors have achieved good performance on datasets such as MS COCO and PASCAL VOC. However, they lack the ability to consider possible relationships between image regions. Current One-Stage object detectors process each image region individually. When considering image size, they are unaware of different image regions due to smaller receptive fields. They completely rely on high-quality local convolutional features to detect objects. However, this is not how the human visual system works. Humans have a reasoning ability to perform visual tasks with the help of acquired knowledge. Many methods have been proposed to mimic human reasoning ability in object detection. On the other hand, most of these methods are more complex and adopt the Two-Stage detection architecture. Therefore, they are not suitable for real-time applications.

In this paper, a new method to incorporate visual reasoning into One-Stage object detection is proposed. This paper integrates the Multi-Head Attention based reasoning layer on top of Neck instead of Backbone. In this way, reasoning information about the relationship between different image regions can be extracted by using more meaningful, fine-grained, and enhanced feature maps.

The contributions of this paper can be summarized as follows:

  • It is proposed to improve One-Stage object detection by visual reasoning. A novel architecture for extracting semantic relations between image regions to predict bounding boxes and class probabilities is proposed.

  • The impact of using only reasoning features on object detection performance is analyzed. It is demonstrated that only convolutional and reasoning features can still run in real-time while achieving better performance than the Baseline model.

  • The effect of leveraging reasoning on the average precision improvement for each object category is analyzed.

2 This paper method

The overall structure of the proposed method is shown in Figure 1. First, Darknet-53 is used for feature extraction, which produces bounding box predictions at 3 different scales like YOLOv3. After the necessary upsampling operation by FPN. Then, the semantic relationship between image regions is extracted in the reasoning layer. The final stage predicts class probabilities and bounding boxes by YOLO Head.

7c2d01427855856c39243ab84efe391b.png
Figure 1 YOLO-Reasoning

2.1 Reasoning Layer

A transformer-encoder-like model is used as the Reasoning layer. The architecture of Reasoning layer is shown in Figure 2.

890d717ffd327aea3f1b08695dc471ed.png
Figure 2 Reasoning layer

1、Flatten

The Multi-Head Attention layer expects a sequence as input. Tensors are reshaped into a sequence in Flatten and input to the Multi-Head Attention layer in this form.

2、Positional Encoding

By its very nature, the Multi-Head Attention layer is order-agnostic. However, regional location information is valuable. To model the order of image regions, a fixed sinusoidal position encoding is employed:

a4626c141e1dd70da348539d281f468f.png

In the formula, i is the position of the grid area in the sequence, and j is the feature depth index that is the same as the feature depth. The values ​​generated by the sine and cosine functions are cat-paired and added to the convolutional feature embeddings in the grid regions.

3、Multi-Head Attention

Multi-Head Attention is the reasoning between grid cells, i.e. the main layers of image regions. The reasoning between different regions of the input sequence is modeled by the self-attention method, and the self-attention mana is based on the three main concepts of query, key and value. At a high level abstraction, a single grid cell in the query sequence searches for potential relationships and tries to relate that cell to other cells in the sequence by key, i.e. the image area. The comparison between the query pair and the key gives the attention weight of this value. The interaction between attention weight and value determines how much focus is placed on other parts of the sequence when representing the current cell.

In the self-attention process, the query matrix, key matrix and value matrix are computed by multiplying the input sequence X by 3 different weight matrices: , and :

e1be78c5c79432c037bd15a77a5aa408.png

To compare the query and key matrices, use scaled dot product attention:

879c4efe0d3cfeb9219b2a2eae8c6176.png

Each grid cell, the image region, is encoded by taking the sum of the columns of the attention weight matrix. The attention weights determine the position of the observation in the value matrix. In other words, when encoding the current grid, they tell which parts of the image are valuable, informative, and relevant.

The self-attention mechanism is further improved by adopting a multi-head approach. In Multi-Head Attention, self-attention is computed in parallel on the head. The main advantage of multiple heads over single heads is that it enables the model to work on different relational subspaces. Each head has a different query, key and value matrix, since these sets are obtained by using separate and randomly initialized weight matrices. The attention in head i is calculated as:

aec834fb89779e06db05b649171253de.png

Then, the attention is connected and transformed using the weight matrix:

ce2648034126ba027bdb581190b30379.png

4、Skip Connections

There are 2 Skip Connections in the reasoning layer. As described in the ResNet paper, backpropagation is improved and propagates the original information through residual skip connections to the following layers.

5、Normalization

The normalization method is applied in 2 places in the reasoning layer. Besides skip connections for residuals, normalization is another key factor in improving backpropagation. To deal with internal covariate shift, the authors use layer normalization.

6、MLP

The output of Multi-Head Attention is normalized and fed into a multilayer perceptron (MLP). The MLP layer consists of 2 linear layers and an intermediate ReLU nonlinear layer:

a194ff454a145dc5e1abab726a2534f9.png

7、Rearrange

Rearrange is the last sublayer of the reasoning layer, where the sequence is transformed back into the shape expected by the detection Head.

2.2 Reasoner configuration

1、YOLOv3-Reasoner1

In this configuration, the FPN output is directly fed into the reasoning layer. Downsampling scales of 16, 8, and 4 are chosen for each Head respectively, making the embedding size of each Head 64. The reasoning layer outputs the input to a 1×1 convolutional layer. The entire architecture of YOLOv3-Reasoner1 is shown in Figure 4.

ab88f4730792a9cd5be7b1b97a52355e.png

Figure 4 YOLOv3-Reasoner1

2、YOLOv3-Reasoner2

In this configuration, the output of the reasoning layer is connected to the FPN output through a Shortcut. Then, the output of the connection layer is fed into a 1×1 convolutional layer to fuse the information consisting of inference and the original single convolutional features. It is possible that some parts of the convolutional features are attenuated. The connection strategy in this paper ensures the reusability of the original convolutional features. The architecture of YOLOv3-Reasoner2 is shown in Figure 5.

d6d4cbcdda9aaae37bdc33c613e3829c.png

Figure 5 YOLOv3-Reasoner2

3 experiments

477e640b9a753cbc0f1c971177d07b5f.png

4 Reference

[1].ANALYSIS OF VISUAL REASONING ON ONE-STAGE OBJECT DETECTION

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

2cacc8c292f1fbcd3a2f3de11e9ff8c0.png

▲Long press to add WeChat group or contribute

79976cd4d623999f7ba8be6ca9fedcf7.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development positions and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

f51efd5806ebaf3953c40eb325590768.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/124009951