YOLO series target detection algorithm - YOLOR

YOLO series target detection algorithm catalog - article link


This article summarizes:

  1. Analyze problems from a unique perspective. Humans can understand things through normal learning (called explicit knowledge) or subconsciously (called tacit knowledge), and they can analyze problems from multiple angles, so consider letting Models also encode explicit and tacit knowledge together, just like people;
  2. A unified network that can complete various tasks is proposed, which learns a general representation by integrating explicit and implicit knowledge, and can complete various tasks through this general representation;
  3. Introduced kernel space alignment, prediction refinement and multi-task learning in the process of tacit knowledge learning;
  4. The methods of modeling tacit knowledge using vector, neural network or matrix decomposition as tools are discussed respectively, and their effectiveness is verified at the same time;
  5. It is confirmed that the proposed implicit representation can accurately correspond to a specific physical feature, and can also present it visually;
  6. Demonstrated that if an operation fits the physical meaning of a target, it can be used to integrate explicit and tacit knowledge, resulting in a multiplier effect;
  7. Combined with state-of-the-art methods, the unified network proposed in this paper achieves comparable accuracy to Scaled-YOLO4-P7 in object detection, and the inference speed is increased by 88%.

Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



YOLO series target detection algorithm-YOLOR

2021.5.10 YOLOR: "YOLOR: You Only Learn One Representation: Unified Network for Multiple Tasks" (You only learn one representation: a unified network for multiple tasks)

1 Introduction

  People "understand" the world through sight, hearing, touch and past experiences. Human experience can be learned through normal learning (called explicit knowledge), or subconsciously (called tacit knowledge). 【Tacit knowledge exists in the human brain. It is the knowledge closely related to personal experience accumulated by people in long-term practice. It is often some skills, which are not easy to express in words, and are not easy to be learned by others. Explicit knowledge, people can Acquired through oral teaching, textbooks, reference materials, periodicals, patent documents, audio-visual media, software, and databases, and can also be transmitted through language, books, text, databases, etc., and can be easily learned by people.] These are through normal Learning or subconsciously learned experiences will be encoded and stored in the brain, using these rich experiences as a huge database, humans can efficiently process data even if they are not seen beforehand.

  In this paper, a unified network is proposed to encode explicit and implicit knowledge together, just like the human brain can learn knowledge from normal learning and subconscious learning. A unified network can generate a unified representation while serving various tasks. Kernel spatial alignment, prediction refinement, and multi-task learning can be performed in convolutional neural networks.

  The results show that when implicit knowledge is introduced into neural networks, it benefits the performance of all tasks. The latent representations learned from the proposed unified network are then further analyzed, demonstrating the ability to capture the physical meaning of different tasks.

1.1 Problem Analysis

insert image description here

  As shown in the figure above, humans can analyze the same piece of data from different angles. However, a trained convolutional neural network (CNN) model can usually only satisfy one objective.

  In general, features extracted from trained CNNs are often difficult to adapt to other types of problems. The main reason for the above problems is that we only extract features from neurons without using the rich implicit knowledge in CNN. When a real human brain is at work, the aforementioned tacit knowledge can effectively help the brain perform various tasks.

  Tacit knowledge refers to knowledge learned in the subconscious state, but there is no systematic definition of how tacit learning works and how to acquire tacit knowledge. In the general definition of neural networks, features obtained from shallow layers are often called explicit knowledge, while features obtained from deep layers are called implicit knowledge. In this paper, knowledge that directly corresponds to observation is called explicit knowledge. The knowledge implicit in the model and not related to observation is called tacit knowledge.

  This paper proposes a unified network to integrate implicit and explicit knowledge, and make the learned model contain a general representation, and this general representation makes the sub-representation suitable for various tasks. Figure 2. © shows the proposed unified network architecture, and the method to build the unified network is to combine compressed sensing and deep learning.
insert image description here

2. Relevant knowledge

  To complete the design of the algorithm in this paper, some knowledge is required, which is mainly divided into three aspects:

  • Explicit deep learning: covers some methods that can automatically adjust or select features based on input data;
  • Implicit Deep Learning: Covers implicit deep knowledge learning and implicit differential derivatives;
  • Knowledge Modeling: Lists several approaches that can be used to integrate explicit and tacit knowledge.

2.1 Explicit Deep Learning

  Explicit deep learning can be done in the following ways:

  • Transformer is a way, it mainly uses query, key or value to obtain self-attention;
  • Non-local network is another way to obtain attention, which mainly extracts paired attention in time and space;
  • Another common explicit deep learning method is to automatically select the appropriate kernel through the input data.

2.2 Implicit Deep Learning

  The methods belonging to the category of implicit deep learning are mainly implicit neural representations and deep balance models. The former mainly obtains parameterized continuous mapping representations of discrete inputs to perform different tasks, while the latter mainly converts implicit learning into residuals. A neural network of the form, and calculate its equilibrium point.

2.3 Knowledge Modeling

  For methods belonging to the category of knowledge modeling, it mainly includes sparse representation and memory network. The former perform modeling using exemplars, predefined full dictionaries, or learned dictionaries, while the latter rely on combining various embedding modalities to form memories and enable memories to be added or changed dynamically.

3. How tacit knowledge works

  The main purpose of this paper's research is to build a unified network that can effectively train implicit knowledge, so we first focus on how to quickly train and infer implicit knowledge. Since the implicit representation zi z_iziIt has nothing to do with the observation, so it can be regarded as a set of constant tensors z = z 1 , z 2 , ⋅ ⋅ ⋅ , zkz={z_1,z_2,···,z_k}z=z1,z2,⋅⋅⋅zk, the following describes how to apply tacit knowledge as a constant tensor to various tasks.

3.1 Manifold space reduction

insert image description here

  A good representation should be able to find an appropriate projection in the manifold space it belongs to and facilitate the subsequent success of the target task. For example, as shown in Figure 3, it would be the best result if the hyperplane in the projected space can successfully classify the target class. In the above example, we can utilize the inner product of projected vectors and implicit representations to achieve dimensionality reduction in the manifold space, effectively achieving the goals of various tasks.

3.2 Kernel space alignment

insert image description here

  Kernel spatial misalignment is a common problem in multi-task and multi_head neural networks, Figure 4. (a) shows an example of kernel spatial misalignment in multi-task and multi_head neural networks. To solve this problem, the output features and implicit representations can be added and multiplied to translate, rotate and scale the kernel space to align each output kernel space of the neural network, as shown in Fig. 4(b). The above operation mode can be widely used in different fields, such as the arrangement of large objects and small objects of features in Feature Pyramid Network (FPN), using knowledge distillation to integrate large models and small models, and dealing with zero-shot domain transfer and other issues.

3.3 More functions

insert image description here
  In addition to functions that can be applied to different tasks, the implicit knowledge can be extended to more functions. As shown in Figure 5, by introducing addition, the offset of the center coordinates can be predicted by the neural network. Multiplication can also be introduced to automatically search the set of hyperparameters for anchors, which is often required for anchor-based object detectors. In addition, point multiplication and connection methods can be used respectively to perform multi-task feature selection and set prerequisites for subsequent calculations.

4 Tacit knowledge

  In this section, we compare the objective functions of traditional networks and the proposed unified network, and explain why introducing implicit knowledge is important for training multipurpose networks. Meanwhile, the details of the method proposed in this paper will also be elaborated.

4.1 Formula representation

  • In Convolutional Neural Networks

  For the objective function of traditional network training, (1) can be used to express as follows:
insert image description here
  where x is the observed value, θ is the parameter set of the neural network, fθ represents the operation of the neural network, ϵ \epsilonϵ is the error term and y is the target for a given task.
  In the training process of traditional neural network, usuallyϵ \epsilonϵ is minimized so that fθ(x) is as close to the target as possible. This means that we expect different observations with the same target to be a single point in the subspace obtained by fθ, as shown in Fig. 6(a). In other words, the solution space we expect to obtain is only for the current taskti t_itidifferent, and invariant to tasks other than various potential tasks, T ∖ ti T\setminus t_iTti,其中 T = { t 1 , t 2 , . . . , t n } T=\{t_1,t_2,...,t_n\} T={ t1t2...tn}
insert image description here

  For general-purpose neural networks, we hope that the resulting representation can serve the TTAll tasks of T , therefore, we need to relaxϵ \epsilonϵ , enabling it to find the solution of each task simultaneously on the manifold space, as shown in Fig. 6(b). However, the above requirements make it impossible to use a simple mathematical method, such as the maximum value of the one-hot vector, or the threshold value of the Euclidean distance, to obtainti t_itisolution. To solve this problem, we have to change the error term ϵ \epsilonϵ is modeled to find solutions to different tasks, as shown in Figure 6(c).

  • unified network

  To train the proposed unified network, we use both explicit and implicit knowledge to model the error term and then use it to guide the multipurpose network training process. The corresponding training formula is as follows:
insert image description here

  where ϵ ex \epsilon_{ex}ϵexϵ im \epsilon_{im}ϵimare operations that model the explicit and implicit errors of observation x and latent code z, respectively. g ϕ g_\phigϕHere is a task-specific operation that is used to combine or select information from explicit and tacit knowledge.

  There are some existing methods to integrate explicit knowledge into f θfi, so (2) can be rewritten as (3):
insert image description here

   ⋆ \star denotes that some combinations off θ f_θfig φ g_φgfPossible operators for , in this article, addition, multiplication, and concatenation operations will be used.

If the derivation process of the error term is extended to handle multiple tasks, the following formula can be obtained:
insert image description here

  其中 Z = { z 1 , z 2 , ⋅ ⋅ ⋅ , z T } Z=\{z_1,z_2,···,z_T\} Z={ z1,z2,⋅⋅⋅,zT} is a set of implicit latent codes for T different tasks, and Φ are parameters that can be used to generate implicit representations from Z. Ψ is used to compute the final output parameters for different combinations of explicit and implicit representations.
  For different tasks, predictions for all z ∈ Z can be obtained using the following formula:
insert image description here
  For all tasks, from a common unified representationf θ ( x ) f_θ(x)fi( x ) , through the task-specific implicit representationg Φ ( z ) g_Φ(z)gF( z ) , and finally use the task-specific discriminatord Ψ d_ΨdPsComplete different tasks.

4.2 Modeling tacit knowledge

  The implicit knowledge proposed in this paper can be modeled in the following ways:

  • Vector/Matrix/Tensor: zzz
    directly uses the vector z as a prior for implicit knowledge and directly as an implicit representation. At this point, each dimension must be assumed to be independent of each other.
  • Neural Network: W z W_zWz
    Use the vector z as the prior of the implicit knowledge, and then use the weight matrix W to perform linear combination or nonlinearization, and then become an implicit representation. At this point, each dimension must be assumed to be interdependent. It is also possible to use more complex neural networks to generate implicit representations. Or use Markov chains to model implicitly represented dependencies between different tasks.
  • Matrix factorization: Z c TZ^T_cZcT
    Using multiple vectors as priors for implicit knowledge, these implicit priors base Z and coefficient c will form an implicit representation. We can also further impose a sparse constraint on c and convert it to a sparse representation. In addition, we can also impose non-negativity constraints on Z and c, converting them into non-negative matrix factorization (NMF) form.

4.3 Training

  Assume that our model starts without any prior implicit knowledge, that is, it has no prior knowledge about the explicit representation f θ ( x ) f_θ(x)fi( x ) has no effect. When combining operations⋆ ∈ { addition , concatenation } \star∈ \{addition, concatenation\}{ addition,co n c a t e na t i o n } , initial implicit priorz ∼ N ( 0 , σ ) z∼N(0, σ)zN ( 0 , σ ) , when the combination operation⋆ is multiplication, z ∼ N ( 1 , σ ) } \star is multiplication, z∼N(1, σ)\}is m u lt i pl i c a t i o n , _zN ( 1 , σ )} . Here, σ is a very small value, which is close to zero. For z and φ, they are both trained with the backpropagation algorithm during the training process.

4.4 Reasoning

  Since the implicit knowledge has nothing to do with the observation x, regardless of the implicit model g ϕ g_\phigϕNo matter how complex it is, it can be reduced to a set of constant tensors before execution in the inference stage. In other words, the formation of implicit information has little effect on the computational complexity of our algorithm. Also, when the above operator is multiplication, if the following layer is a convolutional layer, then the following equation (9) is used for integration. When encountering an addition operator, if the previous layer is a convolutional layer and it has no activation function, then the integration can be done using equation (10) shown below.
insert image description here

5. Comparison between model and experiment

5.1 Model composition

insert image description here

  We choose to apply implicit knowledge to three aspects, including 1) FPN feature alignment, 2) prediction refinement, and 3) multi-task learning in a single model. The tasks covered by multi-task learning include 1) object detection, 2) multi-label image classification and 3) feature embedding. This paper chooses YOLOv4-CSP as the baseline model, and introduces implicit knowledge into the model at the position indicated by the arrow in Figure 8. All training hyperparameters are compared with the default settings of ScaledYOLOv4.

5.2 FPN feature alignment

  An implicit representation is added to the feature map of each FPN for feature alignment, and the corresponding experimental results are shown in Table 1. From the results shown in Table 1, it can be seen that after using the implicit representation for feature space alignment, all the performances, including APs, APm and APl, are improved by about 0.5%.
insert image description here

5.3 Target Detection Prediction Refinement

  Add the implicit representation to the YOLO output layer for prediction refinement. As shown in Table 2, it can be seen that almost all the indicator scores are improved.
insert image description here
  Figure 9 shows how the introduction of implicit representations affects detection results. In the case of object detection, even if we do not provide any prior knowledge of the implicit representation, the proposed learning mechanism can still automatically learn (x, y), (w, h), (obj) and (classes) mode.
insert image description here

5.4 Canonical notation for multitasking

6 Conclusion

  In this paper, we show how to construct a unified network that integrates explicit and implicit knowledge, and prove that its multi-task learning under a single model architecture is still very effective.

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/128325450