Summary of Lightweight Model Design and Deployment

foreword

The core of the lightweight network is to carry out lightweight transformation of the network in terms of volume and speed while maintaining accuracy as much as possible. Regarding the research on how to manually design lightweight networks, there are no widely used guidelines. There are only some guiding ideologies and some design summaries for different chip platforms (different chip architectures). It is recommended that you draw guiding ideologies and suggestions from classic papers , and then actually do the deployment and model performance testing of each hardware platform.

Definition and understanding of some keywords

Calculation FLOPs

  • FLOPs: floating point operations Refers to the number of floating-point operations, understood as the amount of calculation , which can be used to measure the complexity of the algorithm/model time .

  • FLOPS: (all uppercase), Floating-point Operations Per Second, the number of floating-point operations performed per second, understood as calculation speed, is an indicator to measure hardware performance/model speed, that is, the computing power of a chip.

  • MACCs: multiply-accumulate operations, the number of multiply-add operations MACCs is about  FLOPs half of that of Think of w[0]∗x[0]+... as a multiply-accumulate  1 or  MACC.

Memory access cost MAC

MACMemory Access Cost Memory access cost. Refers to the total amount of memory exchange that occurs when a single sample (an image) is input and the model/convolution layer completes a forward pass , that is, the space complexity of the model, and the unit is  Byte.

For the calculation method of FLOPs and MAC, please refer to the neural network model complexity analysis: https://github.com/HarleysZhang/cv_note/blob/79740428b6162630eb80ed3d39052cac52f60c32/9-model_deploy/B-%E7%A5%9E%E7%BB%8F %E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E5%A4%8D%E6%9D%82%E5%BA%A6%E5%88%86%E6 %9E%90.md.

GPU memory bandwidth

  • GPU The memory bandwidth of a , which determines vRAMhow quickly it can move data from memory ( ) to the computing core, is a  GPU more representative metric than memory speed.

  • GPU The value of the memory bandwidth depends on the speed of data transfer between the memory and the computing cores, and the number of separate parallel links in the bus between these two parts .

NVIDIA RTX A4000 Built on  NVIDIA Ampere top of the architecture, its chip specifications are as follows:

NVIDIA RTX A4000 Chip Specifications

A4000 The chip is equipped with  16 GB video  GDDR6 memory, 256 bit memory interface ( GPU and  VRAM the number of independent links on the bus between them), because of these characteristics related to video memory, A4000 the memory bandwidth can be achieved  448 GB/s.

Latency and Throughput

Refer to the document An Introduction to Modern GPU Architecture of NVIDIA-Ashu RegeDirector of Developer Technology  ppt : https://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf.

A general explanation of latency Latency  and throughput  in the field of deep learning Throughput:

  • Latency ( Latency): Both humans and machines need reaction time to make a decision or take an action. Latency is the time elapsed between making a request and receiving a response . In most human-friendly software systems (not just AI systems), latency is measured in milliseconds .

  • Throughput ( Throughput): How much inference results can be delivered given the size of the deep learning network created or deployed. A simple understanding is the maximum number of input samples that the network can process within a time unit (eg, one second) .

CPU is a low-latency low-throughput processor; GPU is a high-latency high-throughput processor .

Volatile GPU Util

Generally, many people use  nvidia-smi commands to view  Volatile GPU Util the data to get  GPU the utilization rate, but! Regarding this utilization ( GPU Util), two misunderstandings are easy to occur :

  • Myth 1:  GPU Utilization rate =  GPU the proportion of internal computing units working. The higher the utilization rate, the more fully the computing power must be exerted.

  • Myth 2: Under the same conditions, the higher the utilization rate, the shorter the time-consuming must be.

But in fact, GPU Util the essence of it just reflects the percentage of time that one or more cores ( kernelGPU are executed on the sampling time period, and the value of the sampling time period  1/6s~1s.

原文为 Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product. 来源文档 nvidia-smi.txt(https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf)

GPU In layman's terms, it is the ratio of the kernel running time to the total time within a period of time  . For  GPU Util example  69%, if the time period is  1s, then in the past  1s , GPU the kernel running time is  0.69s. If  GPU Util yes  0%, it means  GPU it is not being used and is idle.

That is to say, it does not tell us how many are used  for calculation, or how "busy" the program is, or what the memory usage is like. In short, it cannot reflect the performance of computing power .SM 

GPU Util The essence of reference Zhihu article - teach you how to continue to squeeze GPU computing power (https://zhuanlan.zhihu.com/p/346389176) and stackoverflow question and answer (https://link.zhihu.com/?target=https% 3A //stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation).

NVIDIA GPU Architecture

GPU More transistors ( ) are designed transistorsfor data processing ( data process) than data buffering ( data caching) and flow control ( flow control), so they  GPU are well suited for highly parallel computing ( highly parallel computations). At the same time, GPU it provides  CPU higher instruction throughput and memory bandwidth ( instruction throughput and memory bandwidth).

CPUGPU The intuitive comparison chart of  and  is shown below

distribution of chip resources for a CPU versus a GPU

Image source CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

Finally, briefly summarize  GPU some characteristics of the NVIDIA architecture:

  • SIMT ( Single Instruction Multiple Threads) mode, that is, multiple  Core commands can only execute the same command at the same time. Although it looks somewhat similar to the modern one  CPU (  SIMDSIMD), it is actually fundamentally different.

  • More suitable for computationally intensive and data-parallel programs due to the lack of  Cache and  Control.

2008-2020GPU The evolution history of  NVIDIA  architecture is shown in the following figure:

2008-2020 NVIDIA GPU Architecture Evolution History

In addition, for a historical overview of the architecture evolution of the Nvidia  GPU architecture from  2010 2010 to  2020 2010, you can refer to Zhihu’s article - Nvidia GPU architecture has evolved for nearly ten years, from Fermi to Ampere (https://zhuanlan.zhihu.com/p /413145211).

GPU For an in-depth understanding of the architecture, please refer to the article in the blog garden - in-depth GPU hardware architecture and operating mechanism (https://www.cnblogs.com/timlly/p/11471507.html#41-gpu%E6%B8%B2%E6%9F %93%E6%80%BB%E8%A7%88).

Understanding of CNN Architecture

To a certain extent, the deeper and wider the network, the better the performance. channelWidth, that is , the number of  channels ( ), the depth of the network, and layer the number of layers, resnet18 such as  18 a layered network. Note that what we are talking about here has nothing to do with models such as width learning, but specifically refers to the (channel) width of deep convolutional neural networks.

  • The significance of network depth : CNN the network layer can abstract the input image data layer by layer. For example, the first layer learns image edge features, the second layer learns simple shape features, and the third layer learns target shape features. The network depth increases. It also improves the abstraction ability of the model.

  • The meaning of network width : the width (number of channels) of the network represents 3 the number of filters (dimensions). The more filters, the stronger the ability to extract target features, which means that each layer of the network can learn richer features, such as Texture features of different directions and frequencies, etc.

Manual Design of Efficient CNN Architecture Proposals

some conclusions

  1. The inference performance of the analysis model must be combined with the specific inference platform (common such as: Nvidia  GPU, mobile terminal  ARM CPU, end-side  NPU chip, etc.); currently known  CNN factors that affect the inference performance of the model include: operator calculation amount  FLOPs(parameter amount  Params), convolution  block memory Access cost (memory access bandwidth), network parallelism, etc. However, under the same hardware platform and the same network architecture,  FLOPs the acceleration ratio is directly proportional to the inference time acceleration ratio.

  2. It is suggested that direct  metric(e.g. speed  speed) rather than indirect  metric(e.g.  FLOPs) should be considered for lightweight network design.

  3. FLOPs Low is not equal to  latency low, especially on hardware with acceleration functions ( GPU, DSP and  TPU), it is not true, and it must be analyzed in detail based on the specific hardware architecture.

  4. CNN Even if the models  of different network architectures  are FLOPs the same, they  MAC may vary greatly.

  5. Most of the time, for  GPU the chip, Depthwise the convolution operator actually uses a large  FLOPsnumber of low and high data read and write operations. These operations with a high amount of data read and write, and most of the time  GPUthe bottleneck of the chip computing power lies in the memory access bandwidth , so that the model wastes a lot of time on reading and writing data from the video memory, and the resulting  GPU computing power is not fully utilized. ". The source of the conclusion knows the article - FLOPs and model reasoning speed: https://zhuanlan.zhihu.com/p/122943688

some advices

  1. On most hardware, multiples of channel the number  16 are advantageous for efficient computation. For example, the HiSilicon  351x series chips, when the number of input channels is  4 a multiple and the number of output channels is  16 a multiple, the time speedup ratio will be approximately equal to  FLOPs the speedup ratio, which is conducive to providing  NNIE hardware computing utilization.

  2. For low  channel numbers (such as the first few layers of the network),  it is usually more   efficient to use normal on accelerated hardware (NPU chips)  . convolutionseparable convolution(Source MobileDets paper (https://medium.com/ai-blog-tw/mobiledets-flops%E4%B8%8D%E7%AD%89%E6%96%BClatency-%E8%80%83%E9% 87%8F%E4%B8%8D%E5%90%8C%E7%A1%AC%E9%AB%94%E7%9A%84%E9%AB%98%E6%95%88%E6%9E% B6%E6%A7%8B-5bfc27d4c2c8)

  3. The shufflenetv2 paper (https://arxiv.org/pdf/1807.11164.pdf) proposes four practical guidelines for efficient network design : the number of channels of the same size in G1 can be minimized  MAC, and the convolution of G2-too many groups will increase  MAC, G3- network fragmentation will reduce parallelism, G4- element-by-element operations cannot be ignored.

  4. GPU The 3\times 33\times 3 convolution on the chip is very fast, and its calculation density (the theoretical operation divided by the time used) can reach four times that of the 1\times 11\times 1 and 5\times 55\times 5 convolutions. (Source RepVGG paper (https://zhuanlan.zhihu.com/p/344324470))

  5. Starting from solving the problem of gradient information redundancy , improve the efficiency of model reasoning. Such as CSPNet (https://arxiv.org/pdf/1911.11929.pdf) network.

  6. Start with solving  DenseNet the high memory access cost and energy consumption problems caused by dense connections, such as the VoVNet (https://arxiv.org/pdf/1904.09730.pdf) network, which consists of  OSA( One-Shot Aggregation, one aggregation) modules.

Summary of Lightweight Network Model Deployment

On the basis of reading and understanding the classic lightweight network  mobilenet series, MobileDets, shufflenet series, cspnet, vovnet, repvgg and other papers, the following conclusions are made:

  1. Low computing power devices - mobile phone  cpu hardware, consider  mobilenetv1(depth separable volume machine architecture - low  FLOPs), low  FLOPs and low MAC( shuffletnetv2operators channel_shuffle may not be supported on the inference framework).

  2. Dedicated  asic hardware devices - npu chips (Horizon  x3/x4 , HiSilicon  3519, Ambarella, cv22 etc.), target detection issues consider  cspnet the network (reduce repeated gradient information), repvgg(direct connection architecture - simple deployment, high network parallelism is conducive to exerting  GPU computing power, after quantization, there is drop risk).

  3. NVIDIA  gpu hardware- t4 chip, considering  repvgg the network (like  vgg convolutional architecture-high parallelism brings high speed, single-channel architecture saves video memory/memory).

MobileNet block (depthwise separable convolution  block,  )  is relatively inefficient on depthwise separable convolution blockaccelerated hardware (dedicated hardware design - chip).NPU

This conclusion is mentioned in the CSPNet (https://arxiv.org/pdf/1911.11929.pdf) and MobileDets (https://arxiv.org/pdf/2004.14525.pdf) papers.

Unless the chip manufacturer has made a custom optimization to improve  block the computational efficiency of the depth-separable convolution, such as the horizon robot  chip  has made a custom optimization x3 of the depth-separable convolution  .block

The table below is  MobileNetv2 the   performance test results done on ResNet50 some common  chip platforms.NPU

Performance test results of depthwise separable convolution and conventional convolution models on different NPU chip platforms

The above are all the experience of deploying lightweight models on different hardware platforms summarized by reading the lightweight network papers . The actual results still need to run the test manually.

References

  • An Introduction to Modern GPU Architecture:

    https://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf

  • A collection of lightweight network paper analysis:

    https://github.com/HarleysZhang/cv_note/tree/79740428b6162630eb80ed3d39052cac52f60c32/7-model_compression/%E8%BD%BB%E9%87%8F%E7%BA%A7%E7%BD%91%E7%BB%9C%E8%AE%BA%E6%96%87%E8%A7%A3%E6%9E%90

Guess you like

Origin blog.csdn.net/weixin_42620109/article/details/128814871