foreword
The core of the lightweight network is to carry out lightweight transformation of the network in terms of volume and speed while maintaining accuracy as much as possible. Regarding the research on how to manually design lightweight networks, there are no widely used guidelines. There are only some guiding ideologies and some design summaries for different chip platforms (different chip architectures). It is recommended that you draw guiding ideologies and suggestions from classic papers , and then actually do the deployment and model performance testing of each hardware platform.
Definition and understanding of some keywords
Calculation FLOPs
-
FLOPs
:floating point operations
Refers to the number of floating-point operations, understood as the amount of calculation , which can be used to measure the complexity of the algorithm/model time . -
FLOPS
: (all uppercase),Floating-point Operations Per Second
, the number of floating-point operations performed per second, understood as calculation speed, is an indicator to measure hardware performance/model speed, that is, the computing power of a chip. -
MACCs
:multiply-accumulate operations
, the number of multiply-add operationsMACCs
is aboutFLOPs
half of that of Think of w[0]∗x[0]+... as a multiply-accumulate1
orMACC
.
Memory access cost MAC
MAC
: Memory Access Cost
Memory access cost. Refers to the total amount of memory exchange that occurs when a single sample (an image) is input and the model/convolution layer completes a forward pass , that is, the space complexity of the model, and the unit is Byte
.
For the calculation method of FLOPs and MAC, please refer to the neural network model complexity analysis: https://github.com/HarleysZhang/cv_note/blob/79740428b6162630eb80ed3d39052cac52f60c32/9-model_deploy/B-%E7%A5%9E%E7%BB%8F %E7%BD%91%E7%BB%9C%E6%A8%A1%E5%9E%8B%E5%A4%8D%E6%9D%82%E5%BA%A6%E5%88%86%E6 %9E%90.md.
GPU memory bandwidth
-
GPU
The memory bandwidth of a , which determinesvRAM
how quickly it can move data from memory ( ) to the computing core, is aGPU
more representative metric than memory speed. -
GPU
The value of the memory bandwidth depends on the speed of data transfer between the memory and the computing cores, and the number of separate parallel links in the bus between these two parts .
NVIDIA RTX
A4000
Built on NVIDIA Ampere
top of the architecture, its chip specifications are as follows:
NVIDIA RTX A4000 Chip Specifications
A4000
The chip is equipped with 16 GB
video GDDR6
memory, 256
bit memory interface ( GPU
and VRAM
the number of independent links on the bus between them), because of these characteristics related to video memory, A4000
the memory bandwidth can be achieved 448 GB/s
.
Latency and Throughput
Refer to the document An Introduction to Modern GPU Architecture of NVIDIA-Ashu RegeDirector of Developer Technology
ppt
: https://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf.
A general explanation of latency Latency
and throughput in the field of deep learning Throughput
:
-
Latency (
Latency
): Both humans and machines need reaction time to make a decision or take an action. Latency is the time elapsed between making a request and receiving a response . In most human-friendly software systems (not just AI systems), latency is measured in milliseconds . -
Throughput (
Throughput
): How much inference results can be delivered given the size of the deep learning network created or deployed. A simple understanding is the maximum number of input samples that the network can process within a time unit (eg, one second) .
CPU
is a low-latency low-throughput processor; GPU
is a high-latency high-throughput processor .
Volatile GPU Util
Generally, many people use nvidia-smi
commands to view Volatile GPU Util
the data to get GPU
the utilization rate, but! Regarding this utilization ( GPU Util
), two misunderstandings are easy to occur :
-
Myth 1:
GPU
Utilization rate =GPU
the proportion of internal computing units working. The higher the utilization rate, the more fully the computing power must be exerted. -
Myth 2: Under the same conditions, the higher the utilization rate, the shorter the time-consuming must be.
But in fact, GPU Util
the essence of it just reflects the percentage of time that one or more cores ( kernel
) GPU
are executed on the sampling time period, and the value of the sampling time period 1/6s~1s
.
原文为 Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product. 来源文档 nvidia-smi.txt(https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf)
GPU
In layman's terms, it is the ratio of the kernel running time to the total time within a period of time . For GPU Util
example 69%
, if the time period is 1s
, then in the past 1s
, GPU
the kernel running time is 0.69s
. If GPU Util
yes 0%
, it means GPU
it is not being used and is idle.
That is to say, it does not tell us how many are used for calculation, or how "busy" the program is, or what the memory usage is like. In short, it cannot reflect the performance of computing power .SM
GPU Util
The essence of reference Zhihu article - teach you how to continue to squeeze GPU computing power (https://zhuanlan.zhihu.com/p/346389176) and stackoverflow question and answer (https://link.zhihu.com/?target=https% 3A //stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation).
NVIDIA GPU Architecture
GPU
More transistors ( ) are designed transistors
for data processing ( data process
) than data buffering ( data caching
) and flow control ( flow control
), so they GPU
are well suited for highly parallel computing ( highly parallel computations
). At the same time, GPU
it provides CPU
higher instruction throughput and memory bandwidth ( instruction throughput and memory bandwidth
).
CPU
GPU
The intuitive comparison chart of and is shown below
distribution of chip resources for a CPU versus a GPU
Image source CUDA C++ Programming Guide (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
Finally, briefly summarize GPU
some characteristics of the NVIDIA architecture:
-
SIMT
(Single Instruction Multiple Threads
) mode, that is, multipleCore
commands can only execute the same command at the same time. Although it looks somewhat similar to the modern oneCPU
(SIMD
SIMD), it is actually fundamentally different. -
More suitable for computationally intensive and data-parallel programs due to the lack of
Cache
andControl
.
2008-2020
GPU
The evolution history of NVIDIA architecture is shown in the following figure:
2008-2020 NVIDIA GPU Architecture Evolution History
In addition, for a historical overview of the architecture evolution of the Nvidia GPU
architecture from 2010
2010 to 2020
2010, you can refer to Zhihu’s article - Nvidia GPU architecture has evolved for nearly ten years, from Fermi to Ampere (https://zhuanlan.zhihu.com/p /413145211).
GPU
For an in-depth understanding of the architecture, please refer to the article in the blog garden - in-depth GPU hardware architecture and operating mechanism (https://www.cnblogs.com/timlly/p/11471507.html#41-gpu%E6%B8%B2%E6%9F %93%E6%80%BB%E8%A7%88).
Understanding of CNN Architecture
To a certain extent, the deeper and wider the network, the better the performance. channel
Width, that is , the number of channels ( ), the depth of the network, and layer
the number of layers, resnet18
such as 18
a layered network. Note that what we are talking about here has nothing to do with models such as width learning, but specifically refers to the (channel) width of deep convolutional neural networks.
-
The significance of network depth :
CNN
the network layer can abstract the input image data layer by layer. For example, the first layer learns image edge features, the second layer learns simple shape features, and the third layer learns target shape features. The network depth increases. It also improves the abstraction ability of the model. -
The meaning of network width : the width (number of channels) of the network represents
3
the number of filters (dimensions). The more filters, the stronger the ability to extract target features, which means that each layer of the network can learn richer features, such as Texture features of different directions and frequencies, etc.
Manual Design of Efficient CNN Architecture Proposals
some conclusions
-
The inference performance of the analysis model must be combined with the specific inference platform (common such as: Nvidia
GPU
, mobile terminalARM
CPU
, end-sideNPU
chip, etc.); currently knownCNN
factors that affect the inference performance of the model include: operator calculation amountFLOPs
(parameter amountParams
), convolutionblock
memory Access cost (memory access bandwidth), network parallelism, etc. However, under the same hardware platform and the same network architecture,FLOPs
the acceleration ratio is directly proportional to the inference time acceleration ratio. -
It is suggested that direct
metric
(e.g. speedspeed
) rather than indirectmetric
(e.g.FLOPs
) should be considered for lightweight network design. -
FLOPs
Low is not equal tolatency
low, especially on hardware with acceleration functions (GPU
,DSP
andTPU
), it is not true, and it must be analyzed in detail based on the specific hardware architecture. -
CNN
Even if the models of different network architectures areFLOPs
the same, theyMAC
may vary greatly. -
Most of the time, for
GPU
the chip,Depthwise
the convolution operator actually uses a largeFLOPs
number of low and high data read and write operations. These operations with a high amount of data read and write, and most of the timeGPU
the bottleneck of the chip computing power lies in the memory access bandwidth , so that the model wastes a lot of time on reading and writing data from the video memory, and the resultingGPU
computing power is not fully utilized. ". The source of the conclusion knows the article - FLOPs and model reasoning speed: https://zhuanlan.zhihu.com/p/122943688
some advices
-
On most hardware, multiples of
channel
the number16
are advantageous for efficient computation. For example, the HiSilicon351x
series chips, when the number of input channels is4
a multiple and the number of output channels is16
a multiple, the time speedup ratio will be approximately equal toFLOPs
the speedup ratio, which is conducive to providingNNIE
hardware computing utilization. -
For low
channel
numbers (such as the first few layers of the network), it is usually more efficient to use normal on accelerated hardware (NPU chips) .convolution
separable convolution
(Source MobileDets paper (https://medium.com/ai-blog-tw/mobiledets-flops%E4%B8%8D%E7%AD%89%E6%96%BClatency-%E8%80%83%E9% 87%8F%E4%B8%8D%E5%90%8C%E7%A1%AC%E9%AB%94%E7%9A%84%E9%AB%98%E6%95%88%E6%9E% B6%E6%A7%8B-5bfc27d4c2c8) -
The shufflenetv2 paper (https://arxiv.org/pdf/1807.11164.pdf) proposes four practical guidelines for efficient network design : the number of channels of the same size in G1 can be minimized
MAC
, and the convolution of G2-too many groups will increaseMAC
, G3- network fragmentation will reduce parallelism, G4- element-by-element operations cannot be ignored. -
GPU
The 3\times 33\times 3 convolution on the chip is very fast, and its calculation density (the theoretical operation divided by the time used) can reach four times that of the 1\times 11\times 1 and 5\times 55\times 5 convolutions. (Source RepVGG paper (https://zhuanlan.zhihu.com/p/344324470)) -
Starting from solving the problem of gradient information redundancy , improve the efficiency of model reasoning. Such as CSPNet (https://arxiv.org/pdf/1911.11929.pdf) network.
-
Start with solving
DenseNet
the high memory access cost and energy consumption problems caused by dense connections, such as the VoVNet (https://arxiv.org/pdf/1904.09730.pdf) network, which consists ofOSA
(One-Shot Aggregation
, one aggregation) modules.
Summary of Lightweight Network Model Deployment
On the basis of reading and understanding the classic lightweight network mobilenet
series, MobileDets
, shufflenet
series, cspnet
, vovnet
, repvgg
and other papers, the following conclusions are made:
-
Low computing power devices - mobile phone
cpu
hardware, considermobilenetv1
(depth separable volume machine architecture - lowFLOPs
), lowFLOPs
and lowMAC
(shuffletnetv2
operatorschannel_shuffle
may not be supported on the inference framework). -
Dedicated
asic
hardware devices -npu
chips (Horizonx3/x4
, HiSilicon3519
, Ambarella,cv22
etc.), target detection issues considercspnet
the network (reduce repeated gradient information),repvgg
(direct connection architecture - simple deployment, high network parallelism is conducive to exertingGPU
computing power, after quantization, there is drop risk). -
NVIDIA
gpu
hardware-t4
chip, consideringrepvgg
the network (likevgg
convolutional architecture-high parallelism brings high speed, single-channel architecture saves video memory/memory).
MobileNet block
(depthwise separable convolution block
, ) is relatively inefficient on depthwise separable convolution block
accelerated hardware (dedicated hardware design - chip).NPU
This conclusion is mentioned in the CSPNet (https://arxiv.org/pdf/1911.11929.pdf) and MobileDets (https://arxiv.org/pdf/2004.14525.pdf) papers.
Unless the chip manufacturer has made a custom optimization to improve block
the computational efficiency of the depth-separable convolution, such as the horizon robot chip has made a custom optimization x3
of the depth-separable convolution .block
The table below is MobileNetv2
the performance test results done on ResNet50
some common chip platforms.NPU
Performance test results of depthwise separable convolution and conventional convolution models on different NPU chip platforms
The above are all the experience of deploying lightweight models on different hardware platforms summarized by reading the lightweight network papers . The actual results still need to run the test manually.
References
-
An Introduction to Modern GPU Architecture:
https://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
-
A collection of lightweight network paper analysis:
https://github.com/HarleysZhang/cv_note/tree/79740428b6162630eb80ed3d39052cac52f60c32/7-model_compression/%E8%BD%BB%E9%87%8F%E7%BA%A7%E7%BD%91%E7%BB%9C%E8%AE%BA%E6%96%87%E8%A7%A3%E6%9E%90