Research on Deep Learning Network Compression Method for Missile Image

Source: Aviation Weapons

Authors: Gao Yibo, Yang Chuandong, Chen Dong, Ling Chong

Summary

Aiming at the problem that the target recognition algorithm based on deep learning has the characteristics of complex network structure, large amount of parameters, and high calculation delay, and it is difficult to be directly applied to missile missions, the network lightweight methods are summarized. The advantages and characteristics of existing compression methods and lightweight networks are introduced, and excellent algorithms in various aspects are selected for comparison. Finally, combined with the development of deep learning in the field of target detection, the prospect of lightweight missile-borne image target recognition algorithm is prospected.

Key words

Network model compression; Lightweight network; Missile image; Deep learning model; Algorithm transplantation

introduction

Precision-guided weapons are the key factor for winning modern warfare, and the core component of precision-guided weapons is the seeker [1]. During the guidance process, the seeker performs the duties of observing the target, sensing the environment, identifying and tracking, until the precise strike on the target is completed. However, the effect of seeker guidance depends on the accuracy of judging the target position. The laser guidance of semi-active guidance requires outposts to guide the target with laser, which is extremely poor in concealment; the radar guidance of passive guidance is very easy to be detected by various signals in space. Interfere and be captured by enemies. Image guidance uses CMOS (Complementary Metal Oxide Semiconductor) to collect target visible light information, which has strong anti-electromagnetic interference capability and does not need to set up outposts [2]. Different from the special chip ASIC (Application Specific Integrated Circuit) used in ordinary mobile terminals, the core processing unit architecture of the missile-borne computer in the seeker mainly includes DSP (Digital Signal Processing), DSP+FPGA (Field Programmable Gate Array), SoC (System on Chip) and so on. DSP resources are less, unable to meet the current algorithm requirements, and now the mainstream method is to use DSP+FPGA. FPGA has the characteristics of rich interfaces and high flexibility, but it has high power consumption, which is difficult to bear for the deep learning algorithm with high consumption. In this regard, major manufacturers have proposed SoC to improve system performance, reduce system cost, power consumption, weight and size, but the development cycle is long and the development cost is high, so it is not suitable for small arms systems that require rapid application. The chip embedded in SoC has both the resource flexibility of the PL side and the powerful processor function of the PS side, which is suitable for the deployment of deep learning algorithms, but the huge number of parameters and network structure is one of the difficulties that are difficult to be practical. Model compression and lightweight are the key steps in the design, which not only need to meet the sufficient accuracy of the algorithm at the software level, but also ensure that the number of parameters is small and the speed is fast when transplanting hardware. Small internal space, short projectile action time, and high processing speed requirements of embedded hardware platforms are the key factors that limit the deployment of deep learning algorithms on missile-borne platforms [3]. Therefore, it is necessary to study algorithms for compressive deep learning.

1 Tracking algorithm for missile-borne targets

1.1    Characteristics of missile-borne targets

The imaging features of the image-guided missile include the following aspects:

(1) The scale changes greatly. The flying speed of the projectile in the air is very fast, with an average speed of 200-300 m/s, and the scale of the target image in the field of view varies greatly, as shown in Figure 1. When the missile-to-target distance is far away, the image of the target in the field of view is small, and the details are not clear; when the missile-to-target distance is short, the image of the target in the field of view is large, and the details appear continuously. Therefore, the missile image tracking algorithm needs to have scale invariance.

Figure 1 Schematic diagram of the scale change of missile-borne targets

(2) Target rotation. Due to the influence of the shape and aerodynamic characteristics of the projectile, the image-guided missile is accompanied by continuous rotation during flight. Although the image acquired by the camera has been derotated, due to the influence of the steering gear control and derotation precision of the projectile, the projectile will still rotate to a certain extent, resulting in constant changes in the position of the target image in the field of view, as shown in Figure 2 shown. The continuous rotation of the projectile leads to the continuous rotation of the target image, and it can be seen that the position of the target is different at different times. Therefore, the missile image tracking algorithm needs to be invariant to rotation.

Figure 2 Schematic diagram of the rotation change of the missile-borne target

(3) The target enters and exits the field of view (the target enters or leaves the field of view). The image-guided missile is accompanied by a certain degree of nutation during high-speed flight, which causes the target to enter and exit the field of view frequently. When the target is out of the field of view, since there must be the same scene in the two frames before and after, according to the foreground and background feature extraction method in deep learning, the mask activation in deep learning or feature association can be used to predict the range of the target in the next frame ; For targets that are half or partially out of the field of view, the method based on deep learning can easily identify and track them.

1.2    Comparison of various tracking algorithms

Combining the unique characteristics of the above-mentioned missile-borne targets, search and track algorithms, and the comparison of various algorithms is shown in Table 1.

Table 1 Typical tracking method implementation and comparison of advantages and disadvantages

It can be seen from Table 1 that the background subtraction method is only suitable for the recognition of targets in a fixed background, and is easily affected by ambient light, and is not suitable for missile images with constantly changing brightness. The frame-to-frame difference method distinguishes the target area through the difference between the images between frames, but due to the high-speed motion of the projectile, the frame-to-frame difference between the missile images is too large, and it is difficult to identify the effective target area. The optical flow method has poor anti-noise ability, the simple template matching method has poor real-time performance, and this method has poor robustness, lacks template update, and the scale of the missile image is constantly changing, so the accuracy is low. The SFIT algorithm maintains a high level of rotation, scale scaling, and brightness changes, but has poor stability for viewing angle changes, affine changes, and noise, and is not real-time enough. The algorithm based on deep learning is trained for various targets, the trained network has high accuracy, and it has strong adaptability to scale invariance and rotation invariance, even blocking out the field of view. Due to the large number of deep learning training networks, there is a trade-off between speed and accuracy. The common problem is that the network structure is too large, the number of parameters is large, and the amount of calculation is large, which has a huge load on the missile platform. Therefore, it is necessary to find various ways to compress the deep learning network to meet the tolerance of the missile-borne platform.

2. Model compression method

This section focuses on the current mainstream compression methods and lightweight network design, and the specific technologies are shown in Table 2.

Table 2 Various model compression and acceleration methods

2.1    Parameter quantization and sharing

Most operating systems and programming languages ​​default to single-precision numbers composed of 32-bit floating-point numbers, which consume a large amount of memory. Especially in deep learning network models, a large number of weights and activation values ​​require 4 bits of memory. In the missile-borne environment, it is allowed to reduce the number of parameters while ensuring the approximate effect. Technologies such as quantization, hashing, and changing calculation forms are commonly used to reduce the amount of parameters and reduce redundancy.

 2.1.1 Quantization parameters

Common quantization network processes include 1-bit binary network, 2-bit ternary network, and reducing 32-bit floating-point numbers to 16-bit floating-point numbers or 16-bit and 8-bit fixed-point numbers.

Mohammad et al. [4] approximated the filter as a binary value in the binary weight network, and the result can save 32× memory. Guo et al. [5] proposed a multi-category image segmentation model based on the combination of binary weight network structure and depth-first search algorithm, which was applied to the field of face recognition and achieved good results. A version of AlexNet networked with binary weights is as accurate as a fully accurate version of AlexNet.

Li et al. [6] introduced the ternary weight network TWNs neural network, and the weights are limited to +1, 0 and -1. The full-precision weights are converted into three-valued weights using the minimum principle of Euclidean distance. In addition, the ternary function based on the threshold is optimized to obtain a fast and simple approximate solution, and it is proved by experiments that the algorithm improves the image quality while maintaining high prediction performance and compression efficiency; TWNs is better than the recent The proposed network such as its binary precision counterpart is more expressive and effective. At the same time, TWNs can achieve a model compression rate of 16× or 32×, and also have a good compression effect in higher dimensions. Therefore, TWNs are a very potential image classification method based on deep learning. Standards from MNIST, CIFAR-10, and the large ImageNet dataset show that TWNs are only slightly below full precision, but much better than similar binary precision.

Krishnamoorthi[7] proposed a method using nonlinear quantization, which quantizes both weight and input values ​​into unit8 types, and the bias in the activation function is quantized into unit32 bits. This method counts the range distribution of weights in each layer, non-linearly corresponds to each quantized weight, and also provides a weight quantization training network to ensure end-to-end accuracy.

 2.1.2 Weight substitution

Han et al. [8] used the k-means clustering algorithm to classify the weights, and continuously fine-tuned the weights to reduce the loss of precision. Classify according to the adjusted weights to get each center, then use the obtained center value to replace other values ​​of the class, and finally use Huffman coding to compress again. Chen [9] used the hash function to establish HashNet, which suggested quantizing and grouping the network weights before training, and sharing the weights within each group. The structure of HashNet is shown in Figure 3. In this method, the storage value is a hash index and a small amount of weight, which reduces a lot of storage consumption. Since a new algorithm is adopted to deal with the similarity measurement problem of the data set, the amount of calculation required to match the query results is also reduced. Experimental results show that the system can achieve good results, but this technique cannot save runtime memory and inference time, because hash maps are needed to retrieve shared weights.

Figure 3 Structure of HashNet

 2.1.3 Reduce computational complexity

Compared to the cheap addition operation, the computational complexity of the multiplication operation is much higher. The convolution widely used in deep neural networks happens to be cross-correlated, which is convenient for measuring the similarity between input features and convolution filters, but this involves a lot of multiplication between floating point values. Chen et al. [10] proposed adder networks (AdderNets) to replace these large-scale multiplications in deep neural networks, especially convolutional neural networks CNN (Convolutional Neural Network), to obtain less computational complexity and reduce computational costs . AdderNets uses the L1-norm distance between the filter and the input features as the output response, and proposes an adaptive learning rate strategy to enhance the training process of AdderNets according to the magnitude of the gradient of each neuron. AdderNets can achieve 74.9% Top-1 accuracy and 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in the convolutional layer.

2.2    Parameter pruning

The data volume of missile-borne images is large and complex, which is a huge test for the target recognition and classification network, so a powerful and robust network is required, and the embedded network deployment project requires a lighter network, and pruning can reduce the original robust network. Go to a lot of joins and calculations, and maintain considerable precision. The pruning mode is shown in Figure 4.

Figure 4 Pruning example

Pruning can be divided into unstructured pruning and structured pruning. Unstructured pruning is fine-grained pruning, that is, any "redundant" individual neuron weights can be removed to obtain a high compression rate, and the essence is to sparse the convolution kernel matrix. The schematic diagram of pruning granularity is shown in Fig. 5. However, the network structure after pruning is chaotic, and it is difficult to perform further operations, such as shifting instead of convolution, addition instead of multiplication, etc. The smallest unit of structured pruning is a set of parameters within the filter. By setting the evaluation factor for the filter or feature map, or even by deleting the entire filter or some channels, the network is "shrunk", allowing direct optimization of existing software / hardware for effective acceleration. But without a suitable optimization algorithm, the optimal solution cannot be guaranteed.

Figure 5 Schematic diagram of pruning granularity

 2.2.1 Unstructured pruning

Some early methods are basically based on unstructured, and unstructured pruning is fine-grained. Represented by the Optimal Brain Damage proposed by LeCun et al. [11] and the Optimal Brain Surgeon method proposed by Hassibi et al. [12], this type of method is based on the Hessian matrix of the loss function to determine the connection importance, to reduce the number of connections. Neubeck et al. [13] proposed a layer-by-layer derivative building algorithm, which calculates the loss of the parameters of each layer, finds the parameter with the largest loss and prunes its second-order derivative independently, and then retrains after pruning to restore performance.

Non-structural pruning often uses the regularization term as a penalty term of the loss function to limit certain parameters. Therefore, the regularization weight is used to evaluate non-essential parameters and cut them to achieve the compression effect. However, since the L0 norm of the parameters is not differentiable, it is difficult to participate in the optimization. Louisos et al. [14] changed the weight value to a non-negative number, and set the specific parameter weight value to 0 to become a differentiable problem. Tartaglione et al. [15] quantified the influence of weight parameters on the output, and designed regularization items to reduce or reduce the weight value with little influence. Li et al. [16] subtracted the convolution kernel with little effect of L1 regularization. He et al. [17] use the L2 regularization criterion for pruning while training (i.e., soft pruning).

Lin et al. [18] defined the product of the BN (Batch Normalize) layer coefficient and the F-norm of the filter as the influencing factor to evaluate the importance of each connection. Luo et al. [19] proposed an entropy-based method to evaluate the importance of filters. After Chen et al. [20] introduced hardware constraints (such as delay), they deleted the weights with low norm values ​​and adjusted the parameters to maximize the task accuracy. Yang et al. [21] took energy consumption as an optimization constraint, sorted the energy consumption of the convolutional neural network, and cut off the filters with high energy consumption. Compared with traditional convolutional networks, this method greatly reduces energy consumption, and the accuracy loss is also within an acceptable range. He et al. [22] automatically crop models with a given maximum number of hardware resources as constraints.

 2.2.2 Structured pruning

The object of structured pruning changes from a single nerve to a vector-level, channel-level or even the entire filter-level connection. The results show that structured pruning has a similar sparsity rate to unstructured pruning without loss of accuracy; under the same accuracy, structured pruning can obtain better results than unstructured pruning due to index storage. compression ratio; more regular sparsity makes thinning the network more suitable for faster hardware deployment. Therefore, structured pruning has advantages.

Structured pruning can be further subdivided into channel-level or filter-level. Channel pruning is to directly delete one or more channels of a certain layer in the network. The disadvantage is that the accuracy loss is large; the advantage is that it does not generate a sparse matrix, can be directly calculated, and greatly saves space and time. He et al. [23] proposed to list the absolute value sum of all values ​​in the channel as the importance index of the channel, and then prune unimportant channels. The disadvantage is that it takes time to retrain. Molchanov et al. [24] directly considered the impact of feature maps on the loss function, and used the first-order Taylor expansion to approximately delete unimportant channels. Filter-level pruning is to directly cut the entire three-dimensional filter, which can maintain the structural characteristics of the network, which is equivalent to the narrowed original network structure.

It should be pointed out that for structured pruning, the accuracy obtained through reciprocal training is not as high as that of training the target network from scratch. Therefore, the structural design of the pruned target network is actually the most important, and it needs to be designed and selected reasonably.

2.3     Knowledge Distillation

Knowledge distillation is to transfer "knowledge" from a complex model to a network model with a simpler structure and faster operation. Hinton et al. [25] proposed the concept of knowledge distillation for the first time, taking the relationship between the various categories learned in the teacher network as the goal of student network learning, and realizing the process of "knowledge" transfer, as shown in Figure 6. In this method, the overall loss function is composed of two parts, one part is the loss of the reality and the self-test of the student network, this part is hard_loss; the other part is the same "temperature" of the teacher network and the student network, both The loss between, this part is soft_loss. The total loss is the sum of the two. In this way, we can not only find the gap between the student network and the reality, but also learn the connections between the various categories in the teacher network. This method is suitable for small or medium-sized data sets. Since the compressed student model has learned the existing knowledge, the student model can also achieve robust performance when the data set is small.

Figure 6 Knowledge distillation structure diagram

 2.3.1 Improving Student Networks

Sergey et al. [26] used the regional attention mechanism to find high-attention regions in the teacher network, forcing the student network to simulate a powerful teacher network's region of interest to significantly improve its performance. Li et al. [27] proposed a new method of diverting attention, which transfers the "knowledge" relevance of the teacher network and the interested affairs of the student network to the next learning student network, as shown in Figure 7. Among them, AT loss (Attention Loss) is to transfer the knowledge learned by the teacher network through the attention mechanism to the student network through this loss. AT loss is related to both the teacher and student networks in the group, so the student network of the previous group can affect the process of transferring knowledge to the next teacher network, that is, affecting AT loss, which is a process of teaching and learning.

Figure 7 Knowledge distillation combined with attention mechanism

Tian et al. [28] utilize contrastive learning to propose an alternative objective by which students are trained to capture more information in the teacher's representation of the data. Heo et al. [29] activated hidden neurons to make the area boundaries of the student network and the teacher network classification the closest, so as to improve the fineness of the student network classification.

 2.3.2 Improving the teacher network

The information transmission of the teacher network is also a major direction for improvement. Xu [30] proposed a new self-supervised knowledge distillation model, which uses self-supervision to mine more hidden information from the teacher network. You et al. [31] proposed to average the softened input of multiple teacher networks, that is, "knowledge", and impose constraints on the differences between examples, which are then imposed on the student network in the middle layer. Ahn et al. [32] transfer a teacher network pre-trained for the same or similar tasks to a student network.

2.4     Lightweight Network

There is a great demand for lightweight networks in the missile-borne embedded environment, but the application of lightweight networks in the missile-borne environment is less. In recent years, Song Tainian et al. [33] have developed a module that combines spatial and channel attention mechanisms with MobileNetV2 for the missile-borne environment, indicating that lightweight networks have great potential for application in the missile-borne environment. The following summarizes the development of lightweight networks from different structural levels.

 2.4.1 Convolution Kernel Design

Zhang et al. [34] used grouped convolution in ShuffleNet V1 to greatly reduce the parameter amount of grouped convolution parameters, and proposed a channel shuffle (Channel Shuffle) method to enable mutual communication between groups and channels to avoid inbreeding. Ma et al. [35] proposed ShuffleNet V2, and put forward four lightweight network design criteria: (1) When the number of input and output channels is the same, the memory access MAC is the smallest; (2) The group convolution with too large number of groups will increase the MAC; (3) Fragmentation operations are not friendly to parallel acceleration; (4) The memory and time consumption brought by element-wise operations cannot be ignored. Xception[36] is the final version of inception, evolved from GoogLeNet[37] (inception V1). GoogLeNet splits the convolutional layer into different sizes in parallel, such as 3×3, 5×5, 7×7, plus a fully connected layer, plus the dimension reduction of 1×1 convolution, which can be achieved without increasing too much The fine-grained classification is greatly improved under the multi-parameter condition. Subsequently, inception V2 and inception V3 designed depth-wise and point-wise respectively, separated 5×5 and 7×7 into 3×3 convolution kernels, and selected 1×3 and 3×1 volumes under different conditions The accumulation kernel greatly reduces the amount of parameters. The results obtained by this method are very good and have certain scalability. Xception only contains 1×1 and 3×3 modules, sharing 1×1 modules, and each 3×3 convolution is only divided into a part of the first layer to distinguish different convolutions, which is a depth classifiable convolution .

 2.4.2 Layer-level design

Huang et al. [38] proposed to conduct random depth training for networks with residual connections such as ResNet, randomly delete block subsets and use residual connections to bypass each training. Dong et al. [39] equipped each convolutional layer with a lightweight collaborative layer LCCL (Low-Cost Collaborative Layer). This collaborative layer can predict the number and source of zeros after the activation layer, and delete the cost of testing. . Li et al. [40] divided the network layer into weight layer and non-weight layer, calculated weight layer parameters, and ignored the calculation of non-weight layer parameters. [41] designed a sparse connection between filters. Wu et al. [42] replaced convolution with translational feature maps and feature vectors, which reduced the amount of calculation. Chen et al. [43] used the sparse shift layer (SSL) instead of the original convolutional layer to construct the network. In this architecture, the basic block consists of only 1×1 convolutional layers, which are used for dimensionality enhancement and dimensionality reduction, and a sparse shift operation is designed in the middle feature map.

 2.4.3 Network Structure Level Design

Kim et al. [44] proposed SplitNet, which automatically learns to divide each layer of the network into multiple groups, thereby forming a root node, and each child node shares the weight of the root node. Gordon et al. [45] proposed a method of cyclically optimizing the network through pruning and expansion: in the pruning stage, regularization loss is used to judge weights, and weights with little influence are removed; size, therefore, layers with more important nodes end up being wider and corresponding to more resources. However, these methods have certain problems: a large amount of computing time is required; the traversal operation of the nodes leads to the failure of the local optimal solution to search for the global best result. Kim et al. [46] proposed a nested sparse network, each layer is composed of a multi-level network, in which the high-level network and the low-level network share parameters in the form of a network: the low-level network has a small sensory field, rough learning knowledge, and only learns surface features ; The high-level network has a large sensory field, and can see internal correlations, which can be used to learn fine knowledge.

 2.4.4 Neural Network Architecture Search

The high performance of EfficientDet [47] is mainly due to two aspects: a powerful feature extractor and BiFPN. Among them, the research on BiFPN has nothing to do with lightweight, and the feature extractor EfficientNet [48] uses a new network design mode, namely, network structure search NAS (Neural Architecture Search) [49]. NAS is a network design pattern for searching network frameworks. It has been widely used in image recognition, image classification and other fields, and has achieved some meaningful results.

The design idea of ​​NAS is: in a specific search space, use efficient search strategies to conduct neural network searches that are more relevant to practical applications. Many lightweight networks use NAS architecture search, such as NASNet [50] and MobileNetv3 [51]. The search space of EfficientNet mainly includes the common skip network, convolutional network, pooling network, fully connected network, and deep convolution commonly used in lightweight networks. The evaluation strategy is input scale, network depth, and network width. Set of lightweight, high-performance convolutional networks.

In 2016, MIT proposed MetaQNN [52] to use reinforcement learning to learn the type and parameters of each CNN layer, generate a network structure, and form a loss function to participate in reinforcement training. MetaQNN uses the Markov decision process to search the network architecture, and uses the RL (Ruturn Loss) method to generate the CNN architecture. Zoph[53] used the RNN (Recurrent Neural Network) network as the controller to sample and generate the semantics of the network structure, and then used the REINFORCE algorithm to understand the network structure (the early RL method), so as to achieve higher accuracy. By using 800GPU, it finally defeated the artificially designed model with a similar network architecture in the CIFAR-10 dataset, and found a better structure than LSTM (Long-Short-Time Memory).

Real et al. [54] first integrated the idea based on the evolutionary algorithm (Evolutionary Algorithm) into the NAS problem, and proved the high accuracy of the algorithm by using the CIFAR-10 data set. First, the network structure is regarded as DNA (Deoxyribo Nucleic Acid) and encoded. In the process of continuous variation and evolution, the network model will continue to increase, and its accuracy will be judged after each evolution, and the poor ones will be eliminated. Taking another model as a parent node, child nodes are formed by mutation (i.e. random selection in some predetermined network structure change operation). Child nodes are trained and validated for placement in the collection. In the literature [55], the oldest node in the population is removed to make the evolution develop in a younger direction, which is helpful for better exploration. The optimal network structure found in this way is called AmoebaNet. In addition, through the comparison of the three algorithms, it is found that the reinforcement learning algorithm and the evolutionary algorithm are more accurate. Evolutionary algorithms search faster than reinforcement learning (especially in the early days) and produce smaller models.

Gradient-based search algorithms include the DARTS (Differentiable Architecture Search) method proposed by researchers at Carnegie Mellon University (CMU) and Google [56]. A directed acyclic graph of ordered nodes. In this paper, the hidden Markov model is used to describe the local area, and this description is applied to solve the multi-classification problem. Nodes represent implicit representations (such as feature maps), and operators represent edge-wise connections of operators. In the DARTS method, the key technique is to use the softmax function to mix candidates, and determine whether nodes need to be added or deleted by calculating the distance between all operations in each sub-region and the center of the region. For each new subfield, its value should also be updated to keep it unchanged during feedback, forming a differentiable function. With this approach, the optimal structure can be found using a gradient search-based approach. Finally, select the parameter with the highest weight to form the network. In addition, another method with higher search efficiency was proposed in [57]. The approach is to embed the network structure into a search space, using points to represent individual regional networks, in which the predicted value of accuracy can be defined. Through the method based on gradient search, a suitable embedded network structure representation method is found, and after optimization, this representation is embedded into the network structure.

3 Comparison of compression effects

In order to test the guiding significance of various algorithms for missile-borne images, the zynq Ultrascale+ MPSOC ZCU104 of Xilinx's quad-core arm cortex-A53 is used for testing. Xilinx officially has mature deep learning algorithm IP, such as LeNet, AlexNet, ResNet, etc., which have convenient implantation for networks such as parameter quantization sharing, parameter pruning, and knowledge distillation. Using the common data set ImageNet[58] and self-made ship target data set, AlexNet and ResNet are tested for parameter reduction and accuracy change. Define Δaccuracy=Top-1 accuracy rate after compression-Top-1 accuracy rate before compression, #Param is parameter amount after compression/parameter amount before compression, #FLOPs is floating-point calculation amount after compression/floating-point calculation amount before compression. Knowledge distillation represents the teacher network before compression, and represents the student network after compression. The compression effects of different compression methods on AlexNet on ImageNet are shown in Table 3.

Table 3 Compression effect of different compression methods on AlexNet on ImageNet

It can be seen from Table 3 that the results of these four types of compression methods are very different. The parameter quantization of weight quantization [4] has great advantages, and the compression of the amount of parameters and calculation is very considerable, and the accuracy drops within an acceptable range. Observing the results of SplitNet [44], which is not the top lightweight network, shows that lightweight networks also have strong compression capabilities, but the disadvantage is that various lightweight networks emerge in endlessly, and it takes a long time to deploy them on the missile-borne platform. ; On the other hand, more and better networks can be deployed on the platform to achieve better and faster results, which is also the potential of lightweight networks. The performance of parameter pruning and knowledge distillation is not too amazing, but the advantage is that it can be used in combination with other methods to form better performance.

For shells, armored targets are more realistic. In order to compare the compression effect, a self-made armored target dataset is made. This data set uses UAVs to take 500 pictures of various armored targets at an altitude of 2500-3000 m, which is the height of the image guidance stage of the terminal guided projectile. The shooting inclination angle is 30°~40°, which is the pitch angle of the terminal guidance of the shell. By randomly enhancing the brightness of each picture by 1.1~1.5, adding the rotation of each picture at an interval of 90° and the inversion of each picture, the data set is expanded to 3000 target images, including 2400 training images set, 200 validation sets, and 400 test sets, all with a resolution of 500×500, including self-propelled artillery, tanks, and long-range rocket launchers. Among the original 500 pictures, self-propelled artillery accounted for 180 pictures, tanks accounted for 180 pictures, and long-range rocket launchers accounted for 140 pictures. Use AlexNet to test the compression effect of the armored target dataset, and the results are shown in Table 4.

Table 4 Compression effect of different compression methods on the image set loaded on the missile

Various embedded languages ​​now have parameter definitions such as int8 and int16, which allow researchers to directly use quantized parameters to perform calculations during the deployment process. The results in Table 4 also reflect the superiority of parameter quantization. The lightweight network still has super compression ability and high accuracy. From the randomness of pruning, it is speculated that pruning reduces the overfitting of network loss. According to the directivity of the student network, it is speculated that the special ability of knowledge distillation has improved, so parameter pruning and knowledge distillation have improved the accuracy.

From the above tests, it can be seen that the lightweight network has great potential, but due to the large gap between different lightweight network structures and different directivity, it is difficult to deploy. Each algorithm is tested on ImageNet on a computer with NVIDIA RTX 3080 graphics card and AMD 5000 CPU, which provides a certain direction for the next step of deploying various lightweight networks on the missile platform. The relationship between Top-1 accuracy and FLOPs (Floating Point Operations) and the amount of parameters used are shown in Figure 8.

Figure 8 Comprehensive comparison of commonly used lightweight networks

It can be seen from Figure 8 that ResNeXt-101 has high precision, but consumes a huge amount of calculation, and the amount of parameters is nearly 64M. If you want to deploy, you must compress it; while AmoebaNet has higher accuracy, and a small amount of calculation and parameters , this performance is very suitable for deployment, but since this is an algorithm searched in the GPU space, further verification is needed for whether the FPGA and other missile platforms are suitable; the ResNeXt-50 network is not only a hand-designed network, but also has a high accuracy rate. The amount of calculation is also moderate, suitable for deployment after compression.

Select ResNeXt-50 as the backbone network, use the feature extraction network CSPNet of Yolo V3 as the detection head, train 100 times for a verification, and train for the characteristics of the missile-borne target to obtain the tracking results. Figure 9 shows the video results extracted every 5 frames.

Figure 9 Test results of target rotation, scale change, and entering and exiting the field of view

The backbone network of ResNeXt-50, which has been greatly reduced in the number of parameters, has been trained to solve problems such as image rotation, scale changes, and objects entering and leaving the field of view.

To sum up, the potential of various deep learning networks is huge. Choosing an appropriate compression method can ensure high accuracy and appropriate parameter calculations to meet the hardware requirements of the bomb.

4 Outlook

For transplanting the recognition algorithm to the embedded platform, the compression network should be evaluated and analyzed from various aspects, combined with the research on the characteristics of various algorithms reviewed in this paper and the special requirements of the missile-borne platform for the algorithm, the implementation of the recognition algorithm on the missile-borne platform Make the following outlook:

  • (1) The optimization module improves lightweight network performance. In many lightweight networks at this stage, optimization modules are used to improve the computing power of lightweight networks, such as the final inception module of Xception. Adding this module can increase speed and reduce parameters. And NAS provides a new possibility for finding such modules. The NAS search network designed according to the missile-borne embedded environment makes it easier to find suitable lightweight modules to insert into the target algorithm, which provides a new possibility for the realization of the algorithm on the missile-borne platform. train of thought.

  • (2) Make reasonable use of DSP and Mac (Multiplying Accumulator) modules. There are more DSP modules and Mac modules in chips or edge computing devices, and the convolution module can improve the multiply-add module to make fuller and more effective use of resources. For example, FPGA has a large number of parallel Mac modules. After the algorithm is improved and deployed to the device, it can not only shine on bullets, but also have faster speed and higher performance on ordinary smart mobile terminals.

  • (3) Direct compression is adopted. For example, algorithms for compressing existing models, such as commonly used pruning and knowledge distillation, can be used to improve existing models, and can successfully achieve lightweight networks with little or no loss of accuracy. Combined with other methods Using it, its robustness can be guaranteed on the missile-borne platform.

Disclaimer: The articles and pictures reproduced on the official account are for non-commercial educational and scientific research purposes for your reference and discussion, and do not mean to support their views or confirm the authenticity of their content. The copyright belongs to the original author. If the reprinted manuscript involves copyright and other issues, please contact us immediately to delete it.

Guess you like

Origin blog.csdn.net/renhongxia1/article/details/131080434