Cutting-edge visual technology, the direction of the four low-power computers, the pursuit of smaller, faster and more efficient

Deep learning is widely used in target detection and classification of computer vision tasks. However, these applications often require a lot of computation and energy consumption. For example, an image processing classification, VGG-16 needs to be done 15 billion computations, and YOLOv3 need to perform 39 billion calculations.

This poses a problem, how to deploy deep learning in low-power embedded systems or mobile devices as well? One solution is to transfer computing tasks to the cloud side, but that does not ultimately solve the problem because many deep learning applications need to be calculated at the end side, for example, deployed unmanned aerial vehicles (usually work in the case of off network) or application on the satellite.

Since 2016, the industry began to explore models to accelerate research and miniaturization, also made a large number of small schemes. The technology can eliminate the redundant DNNs, the amount of calculation can be reduced more than 75%, more than 50% reduction inference time while destructive to ensure accuracy. But to a large scale deployment of DNNs model at its end, still we need to continue to optimize.

To temper forward, but also how the current progress of low-power computer vision facie case. Abhinav Goel and other Purdue University who recently (DNN focused on reasoning, rather than training) for the progress of research in this area to do a worthwhile reference review.

Papers link: https: //arxiv.org/pdf/2003.11066

In this article, Goel, who will be divided into four categories of low-power reasoning, namely:

1, the quantization parameter and pruning: the number of bits stored DNN model parameters to reduce the computational cost by reducing the memory and used.

2, and the compression convolution filter matrix decomposition: DNN large decomposed into smaller layer layer, in order to reduce the number of redundant memory requirements and matrix operations.

3, the network architecture search: automatically build DNN has a combination of different levels, in order to find the desired performance of the DNN framework.

4, knowledge transfer and distillation: a compact training DNN, calculated to simulate a larger amount of output DNN, features and activation.

Introduction and the advantages and disadvantages of these four methods are summarized in the following figure:

This review Goel et al., In addition to a summary of the advantages and disadvantages of these methods, but also put forward a number of possible improvements, colleagues also proposed a set of indicators to guide the assessment of future research.

First, the quantization parameter and pruning

Memory access has an important impact on energy consumption DNNs. In order to build low-power DNNs, a strategy is a trade-off between performance and the number of memory accesses. For this strategy, there are two ways, one is to quantify the parameters that reduce the size of DNN parameters; the other is pruning, delete unimportant data and the connection from DNNs in.

1, the quantization parameter

Studies have shown that (Courbariaux et. Al.) To train the parameters stored in fixed-point format different bit width, decreases bit width parameter years, although a slight increase in measurement error (error variation that is almost negligible), but We were able to significantly reduce energy consumption. As shown below:

Based on this lay in nature, so there is a lot of work (eg LightNN, CompactNet, FLightNN, etc.), they are given the precision constraints, trying to find the best bit width parameter of DNN. Even Courbariaux, Rastegari, who proposed the binary neural network.

To further reduce the memory requirements DNNs, is currently often used and the quantization parameter model compression bonding method. Han et al., For example, the first quantization parameter to a discrete bin, and then using Huffman coding to compress the bin, so that the model size reduction of 89%, while the accuracy is basically unaffected. Similar, HashedNet DNN will be quantized to a hash bucket is connected, so connected to the same hash bucket will share the same parameter. However, this method requires high training costs, so their use is limited.

Advantages: when the bit width parameter is decreased, the performance DNNs remains substantially unchanged. This is mainly because of the constraint parameter has a positive effect in the training process.

Improvement direction and disadvantages: 1) using a quantization technique DNNs, often require multiple retraining, training energy which makes very large, so this is how to reduce the cost of training technique must be considered; 2) DNNs different layer sensitivity characteristics are different, if the bit width all layers are the same, it will result in poor performance, so how to choose different precision parameters for each connection layer is a key step to improve performance, which can be carried out in the training process Learn.

2, pruning

From DNNs delete unimportant parameters and connections can reduce the number of memory accesses.

Hessian weighted deformation measurements (Hessian-weighted distortion measure) can evaluate the importance of the parameters in DNN, thereby removing redundancy of those parameters, the model DNN reduced size, but this is only based on measured pruning method for full connection layer.

To prune extended to convolution layer, many scholars recount. Anwar et al proposed a method of particle filter; Polyak et al the sample input data, and which is connected to activate the cut sparse; Han et al., Using a new loss function to learn the parameters and connection DNN; Yu et al., Using importance score propagation algorithm to measure the importance of each parameter with respect to the output.

Others have tried to pruning, quantify and compress them simultaneously applied to the model, the model size is reduced by 95%.

Icon: DNN model different compression rate. Wherein P: Pruning, Q: Quantization, C: Compression.

Advantages: As shown above, can prune and quantization, coding combination, it is possible to obtain a more significant performance gains. For example, when used with three, VGG-16 size can be reduced to 2% of its original size. Furthermore, the complexity can be reduced pruning DNN model, thereby reducing the case of overfitting.

Shortcomings and the direction of improvement: Again, pruning will lead to increased training time. As the table while using pruning and quantify, training time is increased by 600%; if sparse constraints on DNN pruning, this problem will become more serious. Further, the advantages of pruning, only when using custom hardware or special data structures for sparse matrix will emerge. Compared to the current connection pruning, Channel pruning stage may be a direction of improvement, since it does not require any particular data structure, it does not produce a sparse matrix.

Second, the matrix decomposition and compression convolution filter

In DNNs convolution operations accounted for a large part, to AlexNet an example in which the whole connection layer accounted for nearly 89% of parameters. To reduce power consumption DNNs Thus, the convolution computation amount of the whole layer and the connection layer parameters should be reduced. This technique has two directions, namely: 1) use of a smaller convolution filter; 2) the amount of the parameter matrix into smaller matrices.

1, compression convolution filter

Compared to larger filters having a smaller convolution filter parameter less computational cost is low.

But if all the big convolution layer are replaced, it will affect the DNN translation invariance, which will reduce the accuracy of DNN model. So some people try to identify those redundant filter, and the filter will be replaced with a smaller swap them. SqueezeNet is such a technique that uses three strategies to convert into a 3 × 3 convolution 1 × 1 convolution.

As shown above, compared to AlexNet, SqueezeNet 98 percent reduction parameter (of course, slightly increases the operand number), and the performance is not affected.

MobileNets separable convolution using deep layer bottleneck (bottleneck layers), to reduce the computation, and the parameters of the delay amount. When using the depth separable convolution (epthwise separable convolutions), characterized by maintaining a small size, and only extend to a larger feature space, thereby achieving a high accuracy.

Advantages: Bottleneck convolution filter greatly reduces the memory requirements and latency of DNNs. For most computer vision tasks, these methods can be obtained SOTA performance. Filtering and pruning compression and quantization techniques quadrature (independently of each other), and therefore these three techniques may be used together to further reduce power consumption.

Improvement direction and disadvantages: 1 × 1 has demonstrated significant overhead convolution calculation DNN small, resulting in poor precision, mainly because operational strength is too low, can not effectively use the hardware. By effective management of the memory, you can improve the strength of the depth separable convolution operation; by optimizing the spatial and temporal locality cache parameters can reduce the number of memory access.

2, matrix decomposition

By tensor or matrix into the form of a co-product (sum-product form), the multi-dimensional tensor into smaller matrices, thereby eliminating redundant computations. Some factoring method may be more than 4 times the acceleration DNN model, since they can be decomposed into a denser matrix parameter matrix, and can avoid the multiplication of a sparse unstructured locality problem.

In order to minimize loss of precision matrix decomposition may be performed by layer: First layer of factoring the parameters, and then subsequent layers of factoring based on the reconstruction error. However, layer by layer optimization approach makes it difficult to apply these methods to large DNN model, because the decomposition of the number of super-model parameters will increase with depth exponentially. Wen et al used a compact shape and depth of the core to reduce the number of structural parameters over the factorization.

About matrix factorization, there are a variety of techniques. Kolda, who proved most factoring techniques can be used to make accelerated DNN model, but these technologies may not be able to achieve the best balance between accuracy and computational complexity. For example, the CPD (typically polyethylene with decomposition), and the BMD (bulk normalized decomposition) in accuracy can do very well, but Tucker-2 Decomposition and Singular Value Decomposition is not how accuracy. CPD in compression is much better than BMD, but CPD related to optimization problems and sometimes unsolvable, which will lead to not break down, and BMD factoring almost always exists.

Advantages: matrix decomposition can reduce the computational cost of DNN, can be used both in the same factorization full convolution layer or connecting layer.

Improvement direction and disadvantages: lack of theoretical explanations, it is difficult to explain why some decomposition (e.g. CPD, BMD) high accuracy can be obtained, while others can not decomposed; In addition, the matrix decomposition performance often associated with computing model obtained gain considerable, resulting in loss of income and offset. In addition, the matrix decomposition is difficult to achieve in large DNN model, because as the depth increases decomposition exponential growth over the participants, the training time is mainly spent on finding the right decomposition hyperparametric; in fact, super-reference is not required from the entire space search, so you can learn how to find better search space in training, so to speed up the training of large DNN model.

Third, the network architecture search

In the design of low-power computer vision program for different tasks may require different DNN model architecture. However, due to the many possibilities on the existence of such a structure, by hand to design an optimal DNN model it is often difficult. The best way is to automate this process, that is, network architecture search technology (Network Architecture Search).

NAS using a recurrent neural network (RNN) as a controller, and use reinforcement learning to build candidate DNN architecture. DNN architecture of these candidate training and validation sets using the test results as a reward function, the next candidate for optimizing the architecture of the controller.

NASNet and AmoebaNet demonstrate the effectiveness of the NAS, they can get SOTA performance obtained by DNN architecture model search.

In order to obtain effective for mobile devices DNN model, Tan et al proposed MNasNet, this model uses a multi-objective reward function in the controller. In the experiment, MNasNet NASNet faster than 2.3-fold, 4.8-fold decrease parameter, to reduce 10-fold operation. In addition, MNasNet also more accurate than NASNet.

However, despite significant effect NAS approach, but most NAS computational algorithms are very large. For example, MNasNet to find an efficient DNN framework on ImageNet data set required 50,000 GPU.

To reduce the computational costs associated with NAS, a number of researchers have suggested tasks and agent-based incentives to search for a candidate architecture. For example, in the example above, we do not use ImageNet, but with a smaller data set CIFAR-10. FBNet is such to deal with, which is 420 times the speed of MNasNet. But Cai et al showed that the agent task of optimizing DNN framework does not guarantee that the objectives and tasks is optimal in order to overcome the limitations of agent-based NAS solution brings, they made Proxyless-NAS, which the method uses a path to reduce the number of candidate pruning stage architecture and uses the delay to process certain gradient-based methods and the like. They found an effective framework in time 300 GPU. When the addition method, called single path NAS (Single-Path NAS) architecture search time may be compressed to the GPU 4, but this acceleration is reduced at the expense of accuracy.

Advantages: NAS by searching all possible architectural space, without the need for any manual intervention, automatically balancing trade-offs between accuracy, memory and delay. NAS can achieve the best accuracy performance, energy consumption on many mobile devices.

Shortcomings and improvements: Computational too big, make it difficult to search large data sets task of architecture. In addition, in order to be found to meet the performance requirements of the framework, it must be trained for each candidate architecture and generate reward function runs on the target device, which can lead to high computational costs. In fact, the candidate DNN may be performed on different subsets of the training data in parallel, thereby reducing the training time; data obtained from the different subsets of the gradient may be combined into a trained DNN. However, this parallel training methods might result in lower accuracy. On the other hand, while maintaining a high rate of convergence, the use of adaptive learning rate can improve accuracy.

Fourth, knowledge transfer and distillation

Large model is more accurate than the smaller model, because the more parameters that allow the learning function can be more complex. Is it possible with a small model also learn such a complex function?

One way is to transfer knowledge (Knowledge Transfer), migrate to the small DNN model by a large knowledge of DNN model available. To learn complex functions, small DNN model will be trained on large data mark DNN model. The idea behind this is that a large DNN marked data will contain a large number of small DNN useful information. Such as large DNN model input image output in a high probability on the label of some kind, then this may mean that these classes share some common visual features; for small DNN model to simulate these probabilities if compared to data from direct to learn, to be able to learn more.

Another technique is to Hinton Father raised in 2014 distilled knowledge (Knowledge Distillation), the training process of this method compared to the knowledge transfer is much simpler. In the knowledge distilled in small DNN model using student - teacher training model, which is a small DNN model student, a group of dedicated teachers DNN model; by training students and teachers to imitate it output, small DNN model complete the overall task. But in Hinton's work, little accuracy DNN model has a corresponding decline somewhat. Li et al., Using the feature vector between teachers and students minimized Euclidean distance, to further improve the accuracy of small DNN model. Similar, FitNet to mimic the features that are drawing teacher student model in each layer. But the above two methods require students to model the structure to make restrictive assumptions, generalization is poor. To solve this problem, Peng et al. Used as the correlation between the index optimization.

Advantages: Based on knowledge transfer and knowledge distillation technology can significantly reduce the computational cost of large-scale pre-training model. Studies have shown that the method can be applied not only knowledge of distillation in computer vision, for example, can use many semi-supervised learning, adaptive and other tasks in the domain.

Shortcomings and improvement direction: Knowledge distillation usually have strict assumptions about the structure and size of students and teachers, it is difficult to be extended to all applications. In addition to current knowledge softmax distillation techniques rely heavily on the output, it does not work together with a different output layer. As the direction of improvement, students can study neuronal activation sequence teacher model, rather than merely imitate the teacher neurons / layer output, which can remove restrictions for students and teachers structure (improve the generalization ability), and reduce the output layer softmax dependence.

Fifth, discussion

In fact, there is no technique to build the most effective DNN model, most of the above mentioned techniques are complementary, can be used simultaneously, thereby reducing energy consumption, reduction model, and improve accuracy. Based on the analysis above, the author in the article last extract 5 Conclusion:

1) reduction parameter and quantization accuracy can significantly reduce the size and complexity of the arithmetic operation models, machine learning library most difficult to achieve manually quantization. NVIDIA TensorRT library provides an interface for this optimization.

2) When optimizing large pre-training DNN, pruning and compression model is valid choice.

3) When a new scratch DNN model training, and the convolution filter should be used to reduce the size of the compressed matrix decomposition and calculation model.

4) NAS DNN can be used to find the optimal model for a single device. The DNN (eg Proxyless-NAS, MNasNet etc.) having a plurality of branches often require expensive starting core and GPU, CPU synchronization.

5) distilling knowledge can be applied to small and medium sized data sets, because fewer assumptions about this DNN architecture students and teachers, to have higher accuracy.

 

Reproduced in: https://mbd.baidu.com/newspage/data/landingsuper?context=%7B%22nid%22%3A%22news_10586499957011354835%22%7D&n_type=0&p_from=1

Guess you like

Origin www.cnblogs.com/demo-deng/p/12607849.html