Unified multi-task learning paradigm based on YOLOR

Introduction

Multi-task learning aims to use a single model to learn multiple tasks and improve the performance of all these tasks simultaneously, a concept similar to humans being able to learn multiple tasks simultaneously and apply the learned knowledge to other tasks superior. In today's rapidly developing AI field, multi-modal and multi-task learning is no longer a new concept, but an inevitable trend in the future. For example, UniMatch that the author introduced to you last time is a typical multi-task learning:

UniMatch

Similarly, in the multi-modal field, theTransformer architecture and cross-attention mechanism are used to align different modes such as speech, text, and images. Status information is even more commonplace. Currently, however, using a model to perform multiple tasks remains a core concept, as it requires the model to be able to make connections between different tasks, just like human cognitive systems.

The article introduced to you today focuses on combining multiple visual tasks, including Semantic segmentation, Object detection, Instance segmentation< /span>. Although these tasks have different purposes from each other, they should have the same or similar definitions semantically. For example, the definition of an object should be consistent between different tasks, such as "car". If the dependencies between tasks are correctly defined, the resulting system will be more powerful and easier to use. image subtitle generation, and even

Specially, this article does not take the 变形金刚 route, but chooses the YOLOR multi-task learning network , and the network architecture applied in YOLOv7, which is designed to optimize the efficiency of gradient transfer. YOLOR leverages knowledge from data observations and learned hidden information to improve shared representation in multi-task learning while minimizing the number of training parameters. The goal of this model is to map the same semantics of different tasks to the same hidden information, so that the model can capture a unified representation for multi-task learning. ELAN

Unlike the previous YOLOR and YOLOv7 which only trained between two tasks, this work effectively scales it to more tasks , including image subtitle descriptions.

YOLOR

In order to maximize the sharing of semantic information in multi-task learning, the authors carefully designed some training strategies. In addition, they also observed that different data augmentation methods have different semantic impacts on different tasks, so they proposed a training process design from a semantic perspective to reduce conflicts between different tasks and enhance the robustness of training. Finally, this paper also applies an asymmetric data augmentation method to reduce the negative impact of semantic inaccuracies, which the authors find to be very helpful for multi-task learning of visual language and visual tasks.

method

Network Architecture

This article uses ELAN design as the main axis of the network, and uses the concept of YOLOR as the basic architecture of the system to build the backbone of the network. Then, a hard parameter sharing mechanism is adopted for multi-task learning, lightweight heads are designed for individual tasks, such as object detection, instance segmentation, and semantic segmentation, and they are incorporated into an image encoder to perform image subtitle generation Task.

The output of these headers can be used to capture features with sufficient semantic information that can satisfy the needs of text encoders. Processing different visual tasks on the encoder side allows the system to obtain outputs with different semantics. In the system proposed in this paper, different foreground objects rely on object detection and instance segmentation to obtain accurate information, while the correct background information is provided by semantic segmentation. As for the text decoder for the image subtitle generation task, the author directly uses the Transformer decoder. Specifically, the image encoder and text decoder can be trained together. This makes the model lightweight and efficient while reducing the consumption of training resources.

The external link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

Finally, the entire work combines YOLOR and ELAN, which follows the design of YOLOv7. As mentioned above, YOLOR is a network architecture specifically designed to perform multiple tasks. Intuitively, it mimics the way humans learn, both explicitly (from data observation and supervised training) and subconsciously (from encoding previously learned common features and connecting them across different tasks). Gradient path analysis was used in ELAN to design the network architecture. Its most important design concept is to optimize the gradient path. This design can make the network more lightweight and can also make the network transmit gradient information more efficiently, thereby accelerating and improving the representation ability of the model.

Object detection and instance segmentation heads

For the target detection task, the target detection head of YOLOv7 is directly used here. For the instance segmentation task, the YOLACT model was chosen, which adds an instance segmentation head to the YOLO architecture. Among them, there is a branch in YOLOv7 that merges YOLACT with the instance segmentation task, so this head was chosen to perform the instance segmentation task. This head performs well in real-time instance segmentation tasks and can theoretically be easily fused with any YOLO-based architecture. The following figure shows the architecture of the object detection head and instance segmentation head:

Semantic segmentation head

For the semantic segmentation task, the authors explored the impact of using single-scale and multi-scale feature combinations on semantic segmentation. Specifically, they designed two methods to obtain semantic masks. One is to directly upsample the feature map from the 8×8 resolution closest to the original resolution to the resolution of 1×1 (single scale), and the other is to sample the neck at three different resolutions (8×8, 16×16 and 32×32) feature maps are combined and then upsampled to 1×1 (multi-scale).

Among them, the results obtained by the two methods are shown in the figure below:

It can be seen that the prediction results of the multi-scale model in the stuff region are noisy. The authors believe this is a result of the semantic gap between object detection and semantic segmentation tasks. Object detection attempts to classify all non-instance areas as background, so whether it is the sky, wall, ceiling, or ground, it is easy to classify it into the same category. For semantic segmentation tasks, the sensitivity of spatial relationships is relatively important, so we try to design the semantic segmentation head in a single-scale manner at the highest resolution (i.e. 8×8).

However, upsampling the feature map back to the original size is inefficient, so another upsampling is performed at a relatively shallow level (from 8×8 to 4×4) to obtain predictions for semantic segmentation. With this method, compared with multi-scale, single-scale reduces the number of parameters by 94.3% and reduces training time by 6.3% while upsampling back to the original size (1×1). Compared with directly upsampling to 1×1, upsampling to 4×4 reduces the training time by 84.4%, while maintaining fewer parameters and achieving the highest accuracy.

Image subtitle generation header

For the image subtitle generation task, this article was inspired byTransformer's success in the field of NLP, so the author implemented the self-attention mechanism in the backbone network and used it as an image subtitle generation task. Encoder. They then used the Transformer text decoder as the head for image subtitle generation:

Table 2 shows that they used the CATR model, which uses a full-scale Transformer as the text decoder, and compared its results with those using only the Transformer decoder. The overall number of parameters is reduced by 7.5% and training time is reduced by 25%, while the BLEU-4 score is slightly improved. It is assumed in this article that the poor performance of the full-scale Transformer is due to its conflict with the backbone network, because the backbone network already performs self-attention and overlaps with the function of the Transformer encoder. Based on the results of this experiment, they used a simplified Transformer text decoder and again used ELAN+YOLOR as the backbone network.

training strategy

data augmentation

Data augmentation is a method used to improve the robustness of the model. However, not all data augmentation methods are necessarily effective, and some methods result in data space and label space< The semantics between /span> are inconsistent. To solve this problem, this paper designs a simple and intuitive training process to learn different tasks.

First, the author organizes several data augmentation processes according to the nature of each task. For example, for object detection, instance segmentation and semantic segmentation tasks, they chose to use MixUp, Cutout, Mosaic, . and and other methods. For the image subtitle generation task, they only used Copy paste and Random perspectiveresizepadding

Furthermore, they simultaneously apply different data augmentation processes on the same image to ensure that the target semantic correctness is maintained for all tasks during the learning process. This design strategy takes into account the robustness of the model in vision tasks while maintaining the consistency of data and semantics between different tasks.

Optimizer strategy

In image captioning tasks, it is common to pretrain an image encoder and then fine-tune the encoder while training a new text decoder. Therefore, it is common to give the image encoder a smaller learning rate to avoid disturbing what the image encoder has already learned. However, in this study, the authors argue that since multiple tasks are trained simultaneously, the image encoder needs to be able to adapt to different data types and outputs, thus requiring a larger learning rate.

To this end, they conducted a series of experiments to verify this point of view and found that the best results occurred when the learning rates of the image encoder and text decoder were equal. This ensures that the image encoder can adapt to different inputs and outputs while continuing to learn to achieve better results than would otherwise be the case. Overall, these training strategies help improve model performance while maintaining consistency and robustness for multi-task learning. The design of these strategies is very important to achieve successful multi-task learning.

experiment

<,,>

Summarize

This paper analyzes the semantic information required for the image subtitle generation task from the perspective of human learning. The researchers analyzed the correlation between different visual tasks and combined multiple tasks for training, maximizing the shared semantics between all tasks. Furthermore, they deeply discuss data augmentation techniques and optimizer patterns to design the training process from a semantic perspective and reduce the impact of semantic errors. Experimental results show that compared with other multi-task models, the model proposed in this article is more lightweight and achieves excellent results on all tasks. In addition, under the multi-task joint learning architecture, by sharing semantics and learning rate, the image subtitle generation task can achieve good performance without using any pre-trained model, and has good scalability.

Discuss

From this, a question can actually be derived, that is, the impact of different network architectures is actually not significant. Through the work of this article, it is not difficult to find that for "unification", it does not mean that only Transformer can do it. "Traditional" CNN can actually do this well. As long as it is properly designed, in fact It all works, and sometimes the cost of trial and error is much lower.

Of course, differences between different network architectures do exist, but the effectiveness of the task depends on the combined influence of multiple factors. Choosing the appropriate network architecture, data, training strategy, and task relevance are key factors to achieve successful multi-task learning. The nature and requirements of different tasks also influence which network architecture is chosen and how it is trained. Therefore, when solving specific tasks, these factors still need to be considered comprehensively to obtain the best performance.

Speaking of this, the author is still looking forward to being able to release its technical report as soon as possible, instead of Microsoft’s indifferent " Evaluation Report"~~~ If you carefully observe the evaluation, you can find that in theory, this time GPT4V should not be a simple "Transformer" that aligns multi-modal information for image understanding (nothing new), nor does it seem to use MOE. (The accuracy is not very sincere), but according to OpenAI’s past work, this is most likely a multi-modal, multi-task learning paradigm, but I always feel that it may be a similar a> This kind of very simple but effective, low-key yet elegant work, the key is how to make it work and still fit very nicely? OpenAIGPT4Vend-to-endCLIP

Guess you like

Origin blog.csdn.net/CVHub/article/details/134224749
Recommended