Roof-line Model Performance Analysis Model Introduction

REF

Performance Analysis of Roofline Model and Deep Learning Model - Programmer Sought

Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures https://people.eecs.berkeley.edu/~kubitron/cs252/handouts/papers/RooflineVyNoYellow.pdf 

 

Introduction to Roof-line Model

The Roof-line Model talks about the maximum computing performance FLOPS that we can theoretically obtain. The abscissa of the image is the calculation intensity, which is the actual calculation amount divided by the memory access amount, and the ordinate is the maximum theoretical calculation performance we can obtain.

When the actual calculation intensity (calculation amount/memory access amount) is greater than Imax, we theoretically obtain the maximum computing performance pi of the hardware. At this time, the program calculation time is the actual calculation amount divided by pi.

When the actual calculation intensity is less than Imax, the maximum calculation performance we can theoretically obtain cannot reach pi, but can only reach the actual calculation amount/memory access amount*beta. This model shows that if the memory access utilization is low, it is difficult for us to achieve the maximum performance of the hardware.

Therefore, we need to increase the calculation intensity of the calculation, so as to maximize the actual calculation performance that can be obtained. For example, operator fusion avoids the intermediate memory access process when the amount of calculation is the same, thereby reducing the amount of memory access and improving the calculation intensity. And matrix multiplication, through reasonable block and other methods, can also reduce the amount of memory access and increase the calculation intensity.

Pay attention to the memory access amount of a calculation theory and the actual realized memory access amount. For example, in matrix multiplication, the theoretical memory access amount may be completely different from the actual memory access amount. However, elemwise calculates that the two may usually be consistent. In addition, the theoretical computing power is clear, but for the bandwidth part, due to the existence of layers in the storage system, the actual bandwidth that can be obtained is also related to the calculation process, and the theoretical bandwidth that can be obtained by an algorithm that utilizes data time and space locality is higher. .

Roof-line analysis practice in the process of matrix multiplication optimization:

[Under construction] CUDA GEMM theoretical performance analysis and kernel optimization-Knowledge

Supongo que te gusta

Origin blog.csdn.net/u013701860/article/details/124635521
Recomendado
Clasificación