One difference between GPU inference and end-to-side NPU inference

On-device AI reasoning is mainly done using NPU. In order to strike a balance between performance, power consumption, area, and versatility, mainstream NPUs adopt accelerator architecture, solidify operators in hardware, and use programmable units to perform some customization. Operators/Long-tail operators take into account flexibility. In terms of computing, in order to improve storage usage efficiency and accelerate computing, under the premise of satisfying computing accuracy, NPUs generally use fixed-point computing units to implement core operators, and meet the requirements of inference accuracy with lower bandwidth requirements and faster computing speeds. In this way, it is necessary to perform quantization and dequantization operations on the data in the pre-processing and post-processing stages of the data to meet the needs of the NPU computing unit for fixed-point data calculations. The working model of the NPU is shown in the following figure:

The GPU is different. The computing unit of the GPU naturally supports floating-point calculations. It does not need to perform quantization and dequantization operations. The model reasoning is more direct. Take my graphics card as an example. As can be seen from the figure below, its floating-point computing power Much higher than fixed-point computing power:

Using the GPU to infer the model does not require quantization and dequantization operations:

The different requirements for quantification in the reasoning process may produce an interesting result, that is, the accuracy performance of the reasoning model may be different. The accuracy table mentioned here

Guess you like

Origin blog.csdn.net/tugouxp/article/details/131019847