IMG Series4 NNA for Efficient Inference

Research on neural network architectures generally prioritizes accuracy over efficiency. Efficiency has been studied in some papers (Tan and Le 2020) (Sandler et al., 2018), but this tends to be CPU or GPU based rather than accelerator based inference.

In original work from the Imagination AI research team, many well-known classification networks trained on ImageNet were evaluated. We are not interested in accuracy or cost per se, but in efficiency, which is a combination of the two. In other words, we want the neural network to achieve high accuracy on our IMG Series4 NNAs at the lowest possible cost.

We cover:

Identifying ImageNet classification network architectures that provide the best accuracy/performance balance on our Series4 NNAs

Use quantization-aware training (QAT) and low-precision weights to drastically reduce costs without compromising accuracy.


Network Architecture for Efficient Reasoning

We explored the trade-off space of different ImageNet classification networks, and measured several cost dimensions as follows, namely:

○ Inference time (seconds per inference)

○ Bandwidth (data transferred per inference)

○ Power consumption of inference (joules per inference)

In each case, we are interested in the network that achieves the highest accuracy at the lowest cost. The accuracy and cost of these networks are shown below as averages for various single-core configurations of our Series4 NNA with varying on-chip memory (OCM) sizes and external bandwidth. These networks are quantized to 8 bits without quantization-aware training (QAT).

9d78fde2217c8e5b12fd80c746fd87cc.png

Figure 1: Trade-off curves for recommender networks. We illustrate the accuracy/performance trade-offs for a range of backbone networks mapped and validated on IMG Series4. We define a trade-off line for optimally selecting a backbone network for developers. The red area shows the bad trade-off area, which is decaying as we approach the trade-off line. The least ideal network will appear in the lower right corner, and the most ideal network will be this curve.

The figure shows the trade-off space between top-1 accuracy and inference time, bandwidth and inference power per second. Networks in the upper left of the figure are more efficient than those in the lower right. Those networks closest to the upper left of the figure are considered to perform most efficiently on the Series4 NNA. We fit a "tradeoff line" at these distribution points, with the suboptimal network underneath.

The network performance on the trade-off line is better than other networks tested on our hardware, and gives a good range of performance in the efficiency trade-off space, from fast/low precision (MobileNet v2) to slow/high precision (EfficientNet Lite v4).

When deploying on Series4, network architects may consider whether one of these networks can be used as a foundation (for example, a backbone network for object detection). For example, VGG-16 (Simonyan and Zisserman, 2015) is a popular backbone network, but it appears below the trade-off line in the figure, indicating that for the achieved accuracy, it is demanding in terms of inference time, bandwidth, and power consumption . We can choose to replace VGG-16 with EfficientNet Lite v3, which can achieve the same accuracy with higher efficiency.

We also noticed that there are diminishing returns when trying to maximize accuracy. For example, the very large RegNet-X-32GF (Radosavovic et al. 2020) achieves the highest accuracy on Series4 among all the networks we analyze, but at a high cost.


Low bit depth and efficient inference

We also investigate ways to improve network efficiency on our Series4 NNAs. A great way to achieve this is to use quantization-aware training (QAT) to reduce the bit depth of the network, enabling more efficient inference. Imagination neural network accelerator hardware supports weights in a low-precision format, which we exploit in this work. This has multiple advantages, including fewer hardware channels, more intensive memory utilization, and lower bandwidth consumption. We map quantization-aware trained (QAT) neural networks with 8-bit, 6-bit, and 4-bit weights onto our Series4 NNA and measure changes in their performance, using 8-bit data throughout.

The results shown below are networks running on a single-core and eight-core Series4 NNA, where the single-core on-chip memory (OCM) is 1MB and the external bandwidth is 12GB/s, while the eight-core Series4 on-chip memory (OCM ) is 88MB and the external bandwidth is 115GB/s. Lower weight bit depths significantly improve inference efficiency, especially in cases where bandwidth and memory are dominated by network weights (e.g. VGG-16). In general, efficiency gains are most pronounced when bandwidth and memory are limited.

Increase the time required for each inference

● Single core, low bandwidth, low memory

abd65e2321de5ee7f74d2b5e31de53be.png

Figure 2: Relative improvement in time per inference performed on a single-core IMG Series4

We observe significant improvements in architectures with high weight ratios, namely VGG-16, HarDNet85 (Chao et al., 2019) and ResNet50 (He et al., 2015). On the other hand, RegNets with more compact grouped convolutions gain less boost from quantization-aware training (QAT) when run on high-bandwidth hardware, but we still achieve 25 inference operations per second on low-bandwidth hardware. -50% boost.

● Octa-core, high bandwidth, large memory

43dedeaee4dae954d7a03e053508e01e.png

Figure 3: Relative improvement in time per inference performed on 8-core Series4

Execution on the 8-core Series4 is less network-bandwidth-constrained and more compute-power-constrained, resulting in a smaller time-per-inference improvement as we compress the network to 6-bit or 4-bit.

Improve memory transfers per inference

● Single core, low bandwidth, low memory

cd6f27c99ab5ef480257d4ffaf1134fd.png

Figure 4: Relative improvement in total bandwidth per inference performed on a single-core Series4.

Figure 4 (top panel) illustrates the reduction of the aggregate bandwidth of RegNet X 8GF and VGG-16 to less than 2.5 times its original bandwidth.

● Octa-core, high bandwidth, large memory

313a4cc8b4635788846c76d86a915bd0.png

Figure 5: Relative improvement in total bandwidth per inference performed on 8-core Series4.

Improve power consumption per inference

● Single core, low bandwidth, low memory

251fedf29271ec671b980df66214c1e1.png

Figure 6: Relative improvement in power consumption per inference executed on a single-core Series4

● Octa-core, high bandwidth, large memory

d66363df915e51aad8c75c23bc699d8e.png

Figure 7: Relative improvement in power per inference execution on 8-core Series4


Accuracy

In most cases, networks with 8- and 6-bit weights have similar precision to the original 32-bit floating-point network. Those networks with 4-bit weights have a more pronounced drop in accuracy. In these experiments, the bit depth of the entire network is fixed. Recent research by Imagination has shown that allowing the network to learn the bit depth of the weights can reduce the size of the overall network without compromising quality, and we expect this to help performance a lot too. Furthermore, we observe that networks with large weight values ​​and lack of grouped convolutions are more robust to compression, which leads to a smaller decrease in accuracy. Note that the same weight bit depth is used for all layers in the network, and we expect adaptive bit depth to lead to higher accuracy (Csefalvay 2022).

a8f8e8493dba2c516d29946af4f1e190.png

Figure 8: Accuracy of 32-bit, 8-bit, 6-bit and 4-bit backbones evaluated on Series4. ResNet50 and VGG-16 are the most robust networks evaluated in our study.


Quantization-Aware Training (QAT) Tool

To perform the above analysis, we developed an in-house quantization-aware training (QAT) framework to quantize the PyTorch model and successfully mapped it onto the Imagination NNA. Versatility and simplicity are the notable strengths of this tool. For quantization-aware training (QAT), we use very similar training hyper-parameters in all networks in this blog post. With careful fine-tuning, we hope to achieve better performance and accuracy, especially at low bit depths, and this will be our future work.


Summarize

It is important to choose a neural network architecture that maximizes performance for a given accuracy. In addition, techniques such as quantization-aware training (QAT) for low-precision inference can be used to further reduce costs without significantly affecting accuracy. This is especially important where the target hardware supports low bit-depth inference, as is the case with the Imagination Series4 NNA.

In this blog post, we have identified several networks that achieve a good trade-off between cost and accuracy at different performance points. Further optimizations, such as taking advantage of lower bit-depth hardware that supports integer weights, can greatly improve performance without compromising accuracy.

references

Chao, Ping, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, and Youn-Long Lin. 2019. “HarDNet: A Low Memory Traffic Network.” Computer Vision and Pattern Recognition.

Csefalvay, Szabolcs. 2022. Self-Compressing Neural Networks. https://blog.imaginationtech.com/self-compressing-neural-networks.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” Computer Vision and Pattern Recognition.

Ping Chao, Chao-Yang Kao, Yu-Shan Ruan, Chien-Hsiang Huang, Youn-Long Lin. 2019. “HarDNet: A Low Memory Traffic Network.” Computer Vision and Pattern Recognition.

Radosavovic, Ilija, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. “Designing Network Design Spaces.” Computer Vision and Pattern Recognition.

Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. “MobileNetV2: Inverted Residuals and Linear Bottlenecks.” The IEEE Conference on Computer Vision and Pattern Recognition.

Simonyan, Karen, and Andrew Zisserman. 2015. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” Computer Vision and Pattern Recognition.

Tan, Mingxing, and Quoc V. Le. 2020. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” International Conference on Machine Learning.

Original link: https://blog.imaginationtech.com/efficient-inference-on-img-series4-nnas

Disclaimer: This article is an original article, and the author, source and original link must be indicated when reprinting.

END

Welcome to join Imagination GPU and artificial intelligence communication group 2

4298c7406a93eef5fdcbb9b9301a887e.jpeg

Join the group, please add the editor WeChat: eetrend89

(Please add company name and title)

recommended reading

Dialogue with the Chairman of Imagination China: Using GPU as the fulcrum to strengthen software and hardware collaboration to facilitate digital transformation

[Video] Technical Director of Imagination China fully interprets IMG DXT GPU

ca034f3bac0aa44525029ee19e14f156.png

Imagination Technologies  is a UK-based company dedicated to the research and development of chips and software intellectual property (IP). Products based on Imagination IP are used in the phones, cars, homes and workplaces of billions of people around the world. For more information on cutting-edge technologies such as the Internet of Things, smart wearables, communications, automotive electronics, and graphics and image development, welcome to Imagination Tech!

Guess you like

Origin blog.csdn.net/weixin_49393016/article/details/129358970