[Architecture] Roofline model analysis using GPU performance

Roofline model to evaluate the performance of depth learning model running on the GPU

As used herein Roofline model running performance are assessed on the GeForce RTX 2060 and TITAN V and VGGnet of two GPU AlexNet.

  1. GPU performance parameters
    according to the official website NVDIA data, GeForce RTX 2060 peak force is calculated 7.5 TFLOPS, memory bandwidth is 336GB / s, the peak force is calculated TITAN V is 7.0 TFLOPS, memory bandwidth is 652.8GB / s.
    Here Insert Picture Description
    FIG 1 GPU operator force data (from NVDIA official website)
    Here Insert Picture Description
    FIG 2 GeForce RTX 2060 memory access performance (NVDIA from official website)
    Here Insert Picture Description
    in FIG. 3 TITAN V performance memory access (NVDIA from official website)
  2. The GPU Roofline model
    Figures 4 and 5 are GeForce RTX 2060 TITAN V and the Roofline Model.
    For GeForce RTX 2060, when calculating the model density of less than 22.27FLOP / BYTE, model performance of the GPU is limited by bandwidth, when the calculated density is greater than the model 22.27FLOP / BYTE, peak performance of the model by the GPU floating point speed limit .
    For TITAN V, when the calculated density model is less than 10.755FLOP / BYTE, model performance of the GPU is limited by bandwidth, when the calculated density is greater than the model 10.755FLOP / BYTE, peak performance of the model by the GPU floating point speed limit.
    Here Insert Picture Description
    FIG 4 GeForce RTX Roofline Model 2060 of
    Here Insert Picture Description
    FIG. 5 TITAN V of Roofline
  3. 深度学习模型的选择
    a) AlexNet
    AlexNet有5个卷积层和3个全连接层
    Here Insert Picture Description
    图6 AlexNet网络结构
    C1:96×11×11×3 (卷积核个数/宽/高/深度) 34848个
    C2:256×5×5×48(卷积核个数/宽/高/深度) 307200个
    C3:384×3×3×256(卷积核个数/宽/高/深度) 884736个
    C4:384×3×3×192(卷积核个数/宽/高/深度) 663552个
    C5:256×3×3×192(卷积核个数/宽/高/深度) 442368个
    R1:4096×6×6×256(卷积核个数/宽/高/深度) 37748736个
    R2:4096×4096 16777216个
    R3:4096×1000 4096000个
    共6000万个参数
    AlexNet 训练一次总共需要的浮点计算数为720M。访存量为59.968M*4B=
    239.872MB计算密度为720M/239.872M = 3FLOP/BYTE
    Here Insert Picture Description
    图7 AlexNet模型每层每秒浮点运算次数及参数数量
    b) VGG16
    Here Insert Picture Description
    图8 VGG16 结构图
    Here Insert Picture Description
    图9 VGG16 的参数数量及运算量
    VGG16的浮点计算量约为15GFLOPS,总访存量约为600MB,计算密度约为 25FLOP/BYTE
  4. Model Performance Analysis
    a) profiling running on GeForce RTX 2060
    can see VGG16 calculated density has exceeded the critical point of two lines in FIG., The peak value of the main performance GPU floating point performance limitations, can theoretically reach the computing speed 7.5TFLOP / S. Since the calculated density is only ALEXNET 3FLOP / BYTE, so the bottleneck of performance is limited by bandwidth, calculating the theoretical maximum achievable speed 1.008TFLOP / S.
    Here Insert Picture Description
    FIG. 10 ALEXNET VGG16 and run on GeForce RTX 2060 Performance Analysis
    b) in TITAN V running on the performance analysis
    can be seen VGG16 calculated density has exceeded the critical point of two lines in FIG its peak performance is mainly affected by the GPU floating point performance limitations, it can theoretically reach the operation speed of 7TFLOP / S. Since the calculated density is only ALEXNET 3FLOP / BYTE, so the bottleneck of performance is limited by bandwidth, calculating the theoretical maximum achievable speed 1.96TFLOP / S.
    Here Insert Picture Description
    Performance Analysis FIG. 11 ALEXNET and VGG16 running on TITAN V
  5. Summary
    After two GPU and two depths learning models Roofline analysis, we can conclude that, due to the higher GeForce RTX 2060 peak computing speed, so VGG16 can get better performance on the GeForce RTX 2060, and TITAN V memory access faster, so ALEXNET performance in the TITAN V better.

Guess you like

Origin blog.csdn.net/yao09605/article/details/91572999