给炼丹配上“透视镜”，飞桨Profiler帮你突破训练瓶颈

深度学习开发者应该都知道，“炼丹”到后期，模型训练性能瓶颈往往是难以突破的一道屏障。明明已经优化了网络模型结构，尝试了各种BuildStrategy（计算图优化策略），但是训练速度还是没有达到要求，性能优化的十八般武艺都用上了，为什么还是达不到预期效果。

问题到底出在哪儿呢？

这时候你一定期望有一个“透视镜”，打开模型运行的黑盒，看看到底是哪个“零部件”出了问题。

看到这篇文章的小伙伴是幸运的，本文介绍的飞桨模型性能分析工具Profiler，就是这样一个“透视镜”，可以帮助开发者庖丁解牛，找到性能瓶颈，实现模型训练速度的突破。

飞桨Profiler

是怎么做到的呢？

Profiler 通过从多个层面、多个维度统计模型执行过程中的OP运行时间，从而为进一步优化模型提供直接的性能数据来源。

说千道万不如举个例子，下面就由“炼丹筑基后期”的PP同学，介绍一下如何通过Profiler透视模型性能，并顺利完成优化任务的两次经历吧。

通过Profiler进行性能优化的实例

场景一：XLNet 混合精度性能优化
PP同学正在实现XLNet的混合精度训练，发现混合精度训练的加速比达不到预期。为了解决这个问题，PP同学尝试了各种优化策略发现性能并没有提升。这个时候，PP同学想到了采用Profiler对整个模型进行性能分析。通过在with语句下添加一个profiler.profiler接口，即可对with下的运行过程进行profiler性能分析。

with profiler.profiler('All', 'total') as prof:

查看分析结果，经过仔细观察，PP同学发现slice_grad这个OP的时间占比异常，其主要性能数据如下所示:

------------------------->     Profiling Report     <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 35377.5
  Computation time       Total: 32290.3     Ratio: 91.2735%
  Framework overhead     Total: 3087.21     Ratio: 8.72648%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 6352        Total: 627.243     Ratio: 1.773%
  GpuMemcpyAsync         Calls: 6312        Total: 621.182     Ratio: 1.75587%
  GpuMemcpySync          Calls: 40          Total: 6.06099     Ratio: 0.0171323%
-------------------------       Event Summary       -------------------------
Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
slice_grad                                                  1900        17120.6     16651.581006 (0.972604) 469.040045 (0.027396)   0.16375     92.6487     9.01085     0.48394
  slice_grad/compute                                        1900        17085.3     16616.287107 (0.972547) 469.040045 (0.027453)   0.150132    92.6314     8.99228     0.482943
  slice_grad/infer_shape                                    1900        7.84785     7.847853 (1.000000)     0.000000 (0.000000)     0.002344    0.131375    0.00413045  0.000221831
  slice_grad/prepare_data                                   1900        3.35192     3.351923 (1.000000)     0.000000 (0.000000)     0.00098     0.027065    0.00176417  9.47472e-05
matmul_grad                                                 3720        2403.56     591.410434 (0.246056)   1812.148788 (0.753944)  0.105223    12.9082     0.646118    0.0679402
  matmul_grad/compute                                       3720        2338.72     526.575728 (0.225155)   1812.148788 (0.774845)  0.092585    12.8887     0.628689    0.0661076
  matmul_grad/infer_shape                                   3720        15.3237     15.323685 (1.000000)    0.000000 (0.000000)     0.002286    0.034366    0.00411927  0.000433147
  matmul_grad/prepare_data                                  3720        5.76746     5.767456 (1.000000)     0.000000 (0.000000)     0.000957    0.033173    0.00155039  0.000163026
cast                                                        14600       1651.43     1240.865640 (0.751388)  410.564599 (0.248612)   0.02195     9.75191     0.113112    0.0466802
  cast/compute                                              14600       1499.45     1088.888215 (0.726190)  410.564599 (0.273810)   0.011647    9.74306     0.102702    0.0423843
  cast/infer_shape                                          14600       39.1496     39.149612 (1.000000)    0.000000 (0.000000)     0.001536    0.213652    0.00268148  0.00110662
  cast/prepare_data                                         14600       22.4475     22.447455 (1.000000)    0.000000 (0.000000)     0.000908    0.130246    0.0015375   0.00063451

仔细观察性能数据，slice_grad这个OP的时间占比是49%左右，但是实际上这个OP的计算并不复杂，并不应该占比这么高，所以存在异常。

-------------------------       Event Summary       -------------------------
Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
slice_grad                                                  1900        17120.6     16651.581006 (0.972604) 469.040045 (0.027396)   0.16375     92.6487     9.01085     0.48394
  slice_grad/compute                                        1900        17085.3     16616.287107 (0.972547) 469.040045 (0.027453)   0.150132    92.6314     8.99228     0.482943
  slice_grad/infer_shape                                    1900        7.84785     7.847853 (1.000000)     0.000000 (0.000000)     0.002344    0.131375    0.00413045  0.000221831
  slice_grad/prepare_data                                   1900        3.35192     3.351923 (1.000000)     0.000000 (0.000000)     0.00098     0.027065    0.00176417  9.47472e-05

在发现slice_grad这个OP的计算时间占比异常后，PP同学初步判断slice_grad这个OP 的实现存在问题。继续分析slice_grad这个OP的具体性能数据，发现CPU的实现占比很大（CPU Time (Ratio)），达到97%，看来这个OP中 CPU的某些实现不合理导致了性能问题。确定问题范围后，PP同学快速定位了导致性能瓶颈的代码，对slice_grad这个OP的CPU部分进行优化。重新跑一下，模型训练的整体性能提升了10%，效果不错。

场景二：控制流性能优化
看完这个例子我们再来看一个例子。PP同学最近在研究PaddlePaddle控制流，看了官网的示例之后开始研究一些控制流的性能数据，但是发现这段示例代码GPU运行的性能数据竟然比CPU运行速度要慢，百思不得其解。PP同学想要提升示例的GPU运行速度，于是决定借助Profiler一探究竟，通过startprofiler 和 stopprofiler两个函数接口指定性能分析范围，缩小分析范围有利于提高定位效率。

import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.profiler as profiler
import time
# 开始Profiler性能分析
profiler.start_profiler('All')
def fn_1():
    return layers.fill_constant(shape=[1, 2], dtype='float32', value=1)

def fn_2():
    return layers.fill_constant(shape=[2, 2], dtype='int32', value=2)

def fn_3():
    return layers.fill_constant(shape=[3], dtype='int32', value=3)

main_program = fluid.default_startup_program()
startup_program = fluid.default_main_program()
with fluid.program_guard(main_program, startup_program):
    x = layers.fill_constant(shape=[1], dtype='float32', value=0.3)
    y = layers.fill_constant(shape=[1], dtype='float32', value=0.1)
    z = layers.fill_constant(shape=[1], dtype='float32', value=0.2)

    pred_1 = layers.less_than(z, x)  # true: 0.2 < 0.3
    pred_2 = layers.less_than(x, y)  # false: 0.3 < 0.1
    pred_3 = layers.equal(x, y)      # false: 0.3 == 0.1

    # 如果pred_1为true则运行fn_1 
    out_1 = layers.case(
        pred_fn_pairs=[(pred_1, fn_1), (pred_2, fn_2)], default=fn_3)

    # 此处default 参数为None，当前面条件都不满足的时候会把最后一个分支当成默认分支
    out_2 = layers.case(pred_fn_pairs=[(pred_2, fn_2), (pred_3, fn_3)])

    exe = fluid.Executor(fluid.CUDAPlace(0))
    for i in range(100):
        if i > 20:
            time_start = time.time()
        res_1, res_2 = exe.run(main_program, fetch_list=[out_1, out_2])
    time_end = time.time()
    print('time cost',time_end-time_start,'s')
# 结束Profiler性能分析
    profiler.stop_profiler('total', '/tmp/profile')
    print(res_1)  # [[1. 1.]]
    print(res_2)  # [3 3 3]

查看Profiler性能分析结果：

------------------------->     Profiling Report     <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
-------------------------     Overhead Summary      -------------------------
Total time: 104.442
  Computation time       Total: 38.1497     Ratio: 36.5271%
  Framework overhead     Total: 66.2927     Ratio: 63.4729%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 1200        Total: 32.5351     Ratio: 31.1512%
  GpuMemcpyAsync         Calls: 800         Total: 20.1238     Ratio: 19.2679%
  GpuMemcpySync          Calls: 400         Total: 12.4112     Ratio: 11.8833%
-------------------------       Event Summary       -------------------------
Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::conditional_block               400         45.318      43.963987 (0.970122)    1.354010 (0.029878)     0.037256    0.637326    0.113295    0.433904
  GpuMemcpyAsync:GPU->CPU                400         11.8308     11.040198 (0.933176)    0.790586 (0.066824)     0.023449    0.059867    0.029577    0.113276
  assign                                 200         8.0128      7.738432 (0.965759)     0.274368 (0.034241)     0.033865    0.069719    0.040064    0.0767198
    GpuMemcpyAsync(same_gpu):GPU->GPU    200         4.55945     4.285085 (0.939824)     0.274368 (0.060176)     0.018909    0.048334    0.0227973   0.0436552
  fill_constant                          200         7.80007     7.511009 (0.962942)     0.289056 (0.037058)     0.034299    0.083957    0.0390003   0.0746829
thread0::fill_constant                   300         13.7209     13.298308 (0.969203)    0.422559 (0.030797)     0.025552    0.31214     0.0457362   0.131373
thread0::select_input                    200         13.4918     12.810170 (0.949480)    0.681598 (0.050520)     0.061153    0.095125    0.0674588   0.129179
  GpuMemcpySync:GPU->CPU                 200         6.69242     6.296225 (0.940800)     0.396191 (0.059200)     0.030713    0.052313    0.0334621   0.0640776
  GpuMemcpyAsync(same_gpu):GPU->GPU      200         3.73361     3.448200 (0.923557)     0.285407 (0.076443)     0.016164    0.047654    0.018668    0.035748
thread0::less_than                       200         7.32649     7.053690 (0.962765)     0.272799 (0.037235)     0.029206    0.07994     0.0366324   0.0701486
thread0::fetch                           200         7.16217     6.766394 (0.944741)     0.395773 (0.055259)     0.031348    0.06381     0.0358108   0.0685753
  GpuMemcpySync:GPU->CPU                 200         5.71881     5.323041 (0.930795)     0.395773 (0.069205)     0.025997    0.046626    0.0285941   0.0547557
thread0::logical_not                     200         7.04657     6.772137 (0.961055)     0.274432 (0.038945)     0.031275    0.054511    0.0352328   0.0674685
thread0::cast                            200         7.04549     6.768117 (0.960631)     0.277375 (0.039369)     0.030785    0.068275    0.0352275   0.0674582
thread0::equal                           100         3.33104     3.192579 (0.958432)     0.138464 (0.041568)     0.030176    0.053589    0.0333104   0.0318936

PP同学观察性能数据，发现在控制流的OP中有很多CPU<--->GPU 的数据拷贝(GpuMemcpy calls 1200次)

-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 1200        Total: 32.5351     Ratio: 31.1512%
  GpuMemcpyAsync         Calls: 800         Total: 20.1238     Ratio: 19.2679%
  GpuMemcpySync          Calls: 400         Total: 12.4112     Ratio: 11.8833%

因为控制流的计算比较简单，是可以放到CPU上实现的，所以数据拷贝很多都是可以优化掉的。于是PP同学采用device_guard⁽¹⁾的配置，指定控制流相关的OP运行在CPU上，减少CPU与GPU 之间的数据拷贝交互。立马动手修改代码：

import paddle.fluid as fluid
import paddle.fluid.layers as layers
import paddle.fluid.profiler as profiler
import time

profiler.start_profiler('All')
def fn_1():
    return layers.fill_constant(shape=[1, 2], dtype='float32', value=1)

def fn_2():
    return layers.fill_constant(shape=[2, 2], dtype='int32', value=2)

def fn_3():
    return layers.fill_constant(shape=[3], dtype='int32', value=3)

main_program = fluid.default_startup_program()
startup_program = fluid.default_main_program()

with fluid.program_guard(main_program, startup_program):
#设置x、 y 、z 运行在CPU上
    with fluid.device_guard("cpu"):
        x = layers.fill_constant(shape=[1], dtype='float32', value=0.3)
        y = layers.fill_constant(shape=[1], dtype='float32', value=0.1)
        z = layers.fill_constant(shape=[1], dtype='float32', value=0.2)
#设置pred_1、 pred_2 、pred_3 运行在CPU上
    with fluid.device_guard("cpu"):
            pred_1 = layers.less_than(z, x)  # true: 0.2 < 0.3
            pred_2 = layers.less_than(x, y)  # false: 0.3 < 0.1
            pred_3 = layers.equal(x, y)      # false: 0.3 == 0.1

   # 如果pred_1为True则运行
    out_1 = layers.case(
        pred_fn_pairs=[(pred_1, fn_1), (pred_2, fn_2)], default=fn_3)

    # 此处default 参数为None,当前面条件都不满足的时候会把最后一个分支当成默认分支
    out_2 = layers.case(pred_fn_pairs=[(pred_2, fn_2), (pred_3, fn_3)])

    exe = fluid.Executor(fluid.CUDAPlace(0))
    for i in range(100):
        if i > 20:
            time_start=time.time()
        res_1, res_2 = exe.run(main_program, fetch_list=[out_1, out_2])
    time_end = time.time()
    print('time cost',time_end-time_start,'s')
    profiler.stop_profiler('total', '/tmp/profile')
    print(res_1)  # [[1. 1.]]
    print(res_2)  # [3 3 3]

对修改的代码进行Profiler，结果如下：

------------------------->     Profiling Report     <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
-------------------------     Overhead Summary      -------------------------
Total time: 58.7053
  Computation time       Total: 19.7797     Ratio: 33.6932%
  Framework overhead     Total: 38.9256     Ratio: 66.3068%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 600         Total: 15.281      Ratio: 26.0301%
  GpuMemcpyAsync         Calls: 400         Total: 8.73338     Ratio: 14.8767%
  GpuMemcpySync          Calls: 200         Total: 6.54766     Ratio: 11.1534%
-------------------------       Event Summary       -------------------------
Event                                    Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::conditional_block               400         29.0445     28.488841 (0.980867)    0.555707 (0.019133)     0.002378    0.383915    0.0726114   0.494752
  fill_constant                          200         9.52572     9.245150 (0.970546)     0.280573 (0.029454)     0.034405    0.159726    0.0476286   0.162263
  assign                                 200         8.37987     8.104736 (0.967167)     0.275134 (0.032833)     0.031566    0.088268    0.0418994   0.142745
    GpuMemcpyAsync(same_gpu):GPU->GPU    200         4.94725     4.672113 (0.944386)     0.275134 (0.055614)     0.017507    0.05243     0.0247362   0.0842726
thread0::fetch                           200         8.15252     7.754766 (0.951210)     0.397758 (0.048790)     0.032766    0.075615    0.0407626   0.138872
  GpuMemcpySync:GPU->CPU                 200         6.54766     6.149903 (0.939252)     0.397758 (0.060748)     0.026847    0.052196    0.0327383   0.111534
thread0::fill_constant                   300         7.08439     7.084390 (1.000000)     0.000000 (0.000000)     0.010914    0.535107    0.0236146   0.120677
thread0::select_input                    200         5.03929     4.765534 (0.945675)     0.273757 (0.054325)     0.020717    0.053569    0.0251965   0.0858405
  GpuMemcpyAsync(same_gpu):GPU->GPU      200         3.78614     3.512380 (0.927695)     0.273757 (0.072305)     0.015393    0.048223    0.0189307   0.064494
thread0::less_than                       200         2.83091     2.830909 (1.000000)     0.000000 (0.000000)     0.009482    0.039851    0.0141545   0.0482224
thread0::logical_not                     200         2.73213     2.732127 (1.000000)     0.000000 (0.000000)     0.010661    0.028379    0.0136606   0.0465397
thread0::cast                            200         2.57408     2.574083 (1.000000)     0.000000 (0.000000)     0.010177    0.031029    0.0128704   0.0438476
thread0::equal                           100         1.24741     1.247408 (1.000000)     0.000000 (0.000000)     0.010344    0.026703    0.0124741   0.0212487

观察Profiler结果，CPU和GPU的数据交互减少了一半，最新的Memcpy总次数从1200次变成了600次。

-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 600         Total: 15.281      Ratio: 26.0301%
  GpuMemcpyAsync         Calls: 400         Total: 8.73338     Ratio: 14.8767%
  GpuMemcpySync          Calls: 200         Total: 6.54766     Ratio: 11.1534%

PP同学对程序进行性能测试，发现运行时间减少近10%，程序性能得到提升，完成了程序性能优化。

从上面的两个实例，可以看到Profiler对辅助模型性能优化有重要的作用，PP同学也对性能优化有了更多的了解和体会。除了实例中的能力，Profiler还有更多维度的可视化能力，做模型性能优化的小伙伴一定不要错过，接下来我们就详细了解一下Profiler的详细功效，希望给小伙伴性能优化的工作带来启发和帮助。

Profiler 简单易用的API 接口

飞桨提供了startprofiler 和 stopprofiler两个函数接口进行性能分析，在使用Profiler工具分析模型性能的时候我们可以配合使用startprofiler和stopprofiler 确定profiler范围，在确定的范围内进行性能分析, 其使用方式如下所示:

import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler
data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
for iter in range(10):
    if(iter == 2):
        #开始Profiler过程，对CPU和GPU进行性能分析统计,tracer_option 选项为Default
        profiler.start_profiler(state='All', tracer_option='Default')  
    if(iter == 8):
        #结束Profiler过程，按照总时间打印分析结果，/tmp路径下生成profile用于后续timeline生成和分析
        profiler.stop_profiler(sorted_key='total', profile_path='/tmp/profile')
    result = exe.run(program=compiled_prog, fetch_list=[out])

上述代码对网络一共训练了10次，我们在第二个iter开始进行Profiler，在第8个iter停止。相关参数如下：

state：用于指定对GPU计算或者CPU计算进行性能数据记录，亦或者两个都进行性能数据记录（All）。
tracer_option：设置如何记录性能数据的，打印整体性能信息还是详细信息。
sorted_key：用于指定profiler结果的打印排序，我们一般会选择按总时间(total)排序，为了关注主要的性能瓶颈。
profile_path：保存profile结果信息的路径。

除了这种灵活的startprofiler 和 stopprofiler函数调用方法，飞桨还提供了Profiler通用性能分析器进行性能分析，在with语句下添加一个profiler.profiler接口，即可对with下的运行过程进行profiler性能分析，其使用接口示例如下所示:

import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
import paddle.fluid.profiler as profiler

data1 = fluid.layers.fill_constant(shape=[1, 3, 8, 8], value=0.5, dtype='float32')
data2 = fluid.layers.fill_constant(shape=[1, 3, 5, 5], value=0.5, dtype='float32')
shape = fluid.layers.shape(data2)
shape = fluid.layers.slice(shape, axes=[0], starts=[0], ends=[4])
out = fluid.layers.crop_tensor(data1, shape=shape)
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
compiled_prog = compiler.CompiledProgram(fluid.default_main_program())
#对with 下的运行过程进行profiler性能分析
with profiler.profiler('All', 'total') as prof:
    for i in range(10):
        result = exe.run(program=compiled_prog, fetch_list=[out])

使用start_profiler和stop_profiler相结合进行性能分析，优点是可以更加灵活的控制profiler范围，缺点是需要多配置一些代码，profiler通用分析器接口则是使用起来更为方便，但是不利于范围控制。在进行分析的时候可以根据自己的需求进行合适的接口进行性能分析。

Profiler 多维度性能分析功能

飞桨Profiler是帮助飞桨开发者定位性能问题的一套工具，通过Profiler接口进行性能分析最后得到打印结果，示例如下：

------------------------->     Profiling Report     <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 7.30941
  Computation time       Total: 0.225158    Ratio: 3.08039%
  Framework overhead     Total: 7.08425     Ratio: 96.9196%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 2           Total: 0.74759     Ratio: 10.2278%
  GpuMemcpyAsync         Calls: 2           Total: 0.74759     Ratio: 10.2278%
-------------------------       Event Summary       -------------------------
Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
FastThreadedSSAGraphExecutorPrepare                         1           6.08951     6.076904 (0.997930)     0.012608 (0.002070)     6.08951     6.08951     6.08951     0.833106
GpuMemcpyAsync:CPU->GPU                                     1           0.703009    0.700705 (0.996723)     0.002304 (0.003277)     0.703009    0.703009    0.703009    0.0961787
mul                                                         1           0.219497    0.211977 (0.965740)     0.007520 (0.034260)     0.219497    0.219497    0.219497    0.0300294
elementwise_add                                             1           0.108017    0.105905 (0.980448)     0.002112 (0.019552)     0.108017    0.108017    0.108017    0.0147778
Fetch                                                       1           0.102637    0.100142 (0.975691)     0.002495 (0.024309)     0.102637    0.102637    0.102637    0.0140418
  GpuMemcpyAsync:GPU->CPU                                   1           0.044581    0.042086 (0.944034)     0.002495 (0.055966)     0.044581    0.044581    0.044581    0.00609913
mean                                                        1           0.064722    0.061234 (0.946108)     0.003488 (0.053892)     0.064722    0.064722    0.064722    0.00885462
InitLocalVars                                               1           0.006395    0.006395 (1.000000)     0.000000 (0.000000)     0.006395    0.006395    0.006395    0.0008749
eager_deletion                                              2           0.006166    0.006166 (1.000000)     0.000000 (0.000000)     0.002291    0.003875    0.003083    0.000843571
ScopeBufferedMonitor::post_local_exec_scopes_process        1           0.005985    0.005985 (1.000000)     0.000000 (0.000000)     0.005985    0.005985    0.005985    0.000818808
ScopeBufferedMonitor::pre_local_exec_scopes_process         1           0.003467    0.003467 (1.000000)     0.000000 (0.000000)     0.003467    0.003467    0.003467    0.00047432
------------------------->     Profiling Report     <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
-------------------------       Event Summary       -------------------------
Event                                                                Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread1::mul                                                         1           0.219497    0.211977 (0.965740)     0.007520 (0.034260)     0.219497    0.219497    0.219497    0.438084
thread1::elementwise_add                                             1           0.108017    0.105905 (0.980448)     0.002112 (0.019552)     0.108017    0.108017    0.108017    0.215586
thread1::Fetch                                                       1           0.102637    0.100142 (0.975691)     0.002495 (0.024309)     0.102637    0.102637    0.102637    0.204848
  GpuMemcpyAsync:GPU->CPU                                            1           0.044581    0.042086 (0.944034)     0.002495 (0.055966)     0.044581    0.044581    0.044581    0.0889771
thread1::mean                                                        1           0.064722    0.061234 (0.946108)     0.003488 (0.053892)     0.064722    0.064722    0.064722    0.129176
thread1::eager_deletion                                              2           0.006166    0.006166 (1.000000)     0.000000 (0.000000)     0.002291    0.003875    0.003083    0.0123064
thread0::FastThreadedSSAGraphExecutorPrepare                         1           6.08951     6.076904 (0.997930)     0.012608 (0.002070)     6.08951     6.08951     6.08951     0.894416
thread0::GpuMemcpyAsync:CPU->GPU                                     1           0.703009    0.700705 (0.996723)     0.002304 (0.003277)     0.703009    0.703009    0.703009    0.103257
thread0::InitLocalVars                                               1           0.006395    0.006395 (1.000000)     0.000000 (0.000000)     0.006395    0.006395    0.006395    0.000939285
thread0::ScopeBufferedMonitor::post_local_exec_scopes_process        1           0.005985    0.005985 (1.000000)     0.000000 (0.000000)     0.005985    0.005985    0.005985    0.000879065
thread0::ScopeBufferedMonitor::pre_local_exec_scopes_process         1           0.003467    0.003467 (1.000000)     0.000000 (0.000000)     0.003467    0.003467    0.003467    0.000509226

我们可以观察到其主要分为“框架性能分析结果”和“OP性能分析结果”。

框架性能分析

框架性能分析结果中，我们可以看到在训练过程中的框架运行开销，这里的开销主要指除了OP的计算函数之外的所有时间开销。同时为了更好的反映在训练过程中GPU和CPU的数据交互过程，提供了Memcpy的次数统计，CPU和GPU过多的内存交互会导致性能下降，结合device_guard 我们可以减少CPU和GPU的数据交互，提升性能。

------------------------->     Profiling Report     <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Total time: 7.30941
  Computation time       Total: 0.225158    Ratio: 3.08039%
  Framework overhead     Total: 7.08425     Ratio: 96.9196%
-------------------------     GpuMemCpy Summary     -------------------------
GpuMemcpy                Calls: 2           Total: 0.74759     Ratio: 10.2278%
  GpuMemcpyAsync         Calls: 2           Total: 0.74759     Ratio: 10.2278%

OP多维度性能分析

OP性能分析结果包括两部分：合并线程OP性能分析结果、非合并线程OP性能分析结果。

在合并线程OP性能分析的打印结果中，我们可以观察到每个OP在所有线程中调用次数，总耗时(包括平均耗时、最小耗时、最大耗时)、CPU耗时、GPU耗时、耗时比例相关信息。

-------------------------       Event Summary       -------------------------
Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
FastThreadedSSAGraphExecutorPrepare                         1           6.08951     6.076904 (0.997930)     0.012608 (0.002070)     6.08951     6.08951     6.08951     0.833106
GpuMemcpyAsync:CPU->GPU                                     1           0.703009    0.700705 (0.996723)     0.002304 (0.003277)     0.703009    0.703009    0.703009    0.0961787
mul                                                         1           0.219497    0.211977 (0.965740)     0.007520 (0.034260)     0.219497    0.219497    0.219497    0.0300294
elementwise_add                                             1           0.108017    0.105905 (0.980448)     0.002112 (0.019552)     0.108017    0.108017    0.108017    0.0147778
Fetch                                                       1           0.102637    0.100142 (0.975691)     0.002495 (0.024309)     0.102637    0.102637    0.102637    0.0140418
  GpuMemcpyAsync:GPU->CPU                                   1           0.044581    0.042086 (0.944034)     0.002495 (0.055966)     0.044581    0.044581    0.044581    0.00609913
mean                                                        1           0.064722    0.061234 (0.946108)     0.003488 (0.053892)     0.064722    0.064722    0.064722    0.00885462
InitLocalVars                                               1           0.006395    0.006395 (1.000000)     0.000000 (0.000000)     0.006395    0.006395    0.006395    0.0008749
eager_deletion                                              2           0.006166    0.006166 (1.000000)     0.000000 (0.000000)     0.002291    0.003875    0.003083    0.000843571
ScopeBufferedMonitor::post_local_exec_scopes_process        1           0.005985    0.005985 (1.000000)     0.000000 (0.000000)     0.005985    0.005985    0.005985    0.000818808
ScopeBufferedMonitor::pre_local_exec_scopes_process         1           0.003467    0.003467 (1.000000)     0.000000 (0.000000)     0.003467    0.003467    0.003467    0.00047432

非合并线程OP性能分析的打印结果和合并线程类似，只是将不同线程的性能数据结果独立开来统计。

-------------------------       Event Summary       -------------------------
Event                                                                Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread1::mul                                                         1           0.219497    0.211977 (0.965740)     0.007520 (0.034260)     0.219497    0.219497    0.219497    0.438084
thread1::elementwise_add                                             1           0.108017    0.105905 (0.980448)     0.002112 (0.019552)     0.108017    0.108017    0.108017    0.215586
thread1::Fetch                                                       1           0.102637    0.100142 (0.975691)     0.002495 (0.024309)     0.102637    0.102637    0.102637    0.204848
  GpuMemcpyAsync:GPU->CPU                                            1           0.044581    0.042086 (0.944034)     0.002495 (0.055966)     0.044581    0.044581    0.044581    0.0889771
thread1::mean                                                        1           0.064722    0.061234 (0.946108)     0.003488 (0.053892)     0.064722    0.064722    0.064722    0.129176
thread1::eager_deletion                                              2           0.006166    0.006166 (1.000000)     0.000000 (0.000000)     0.002291    0.003875    0.003083    0.0123064
thread0::FastThreadedSSAGraphExecutorPrepare                         1           6.08951     6.076904 (0.997930)     0.012608 (0.002070)     6.08951     6.08951     6.08951     0.894416
thread0::GpuMemcpyAsync:CPU->GPU                                     1           0.703009    0.700705 (0.996723)     0.002304 (0.003277)     0.703009    0.703009    0.703009    0.103257
thread0::InitLocalVars                                               1           0.006395    0.006395 (1.000000)     0.000000 (0.000000)     0.006395    0.006395    0.006395    0.000939285
thread0::ScopeBufferedMonitor::post_local_exec_scopes_process        1           0.005985    0.005985 (1.000000)     0.000000 (0.000000)     0.005985    0.005985    0.005985    0.000879065
thread0::ScopeBufferedMonitor::pre_local_exec_scopes_process         1           0.003467    0.003467 (1.000000)     0.000000 (0.000000)     0.003467    0.003467    0.003467    0.000509226

同时Profiler提供了多种排序打印功能，一般来说我们是按照总耗时进行排序，在性能优化过程中，优化耗时占比较大的部分，是解决性能问题的突破口。另外，Profiler提供了OP之间层级信息打印的功能，在对OP进行性能分析的时候会保留其层级关系，最后在打印性能分析结构的时候也会保留不同OP之间的层级关系信息。这样可以更加直观地反映OP性能信息，比如对于OP中数据拷贝(例如GpuMemcpyAsync:GPU->CPU)事件等能够很好的从层级信息中反映出来，这样可以有利于我们便捷地进行性能分析工作。

OP深层次性能分析

观察前面Profiler API接口，可以看到提供了tracer_option参数， tracer_option提供了三个选项，分别为Default、OpDetail、AllOpDetail。Default选项只打印OP整体性能信息，OpDetail和AllOpDetail则会打印其他更细致的信息。

Default选项的打印结果可以参见前文中OP层级信息打印的示例，可以查看OP概要的耗时信息，不拆分OP详细过程的耗时信息。
OpDetail选项则打印每个OP的详细性能数据，包括infer_shape、prepare_data、compute三个具体时间，可以帮助我们更好地了解每个OP下不同部分的耗时，更加针对性地进行优化。infer_shape是指OP进行shape推断和检查的过程，prepare_data是进行计算的数据准备过程，compute则是真正进行OP计算的过程。其分析结果如下所示:

-------------------------       Event Summary       -------------------------

Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
fill_constant                                               12          9.59502     9.577872 (0.998212)     0.017152 (0.001788)     0.078359    4.11928     0.799585    0.772524
  fill_constant/compute                                     12          9.29641     9.279253 (0.998155)     0.017152 (0.001845)     0.064842    4.06913     0.7747      0.748481
  fill_constant/infer_shape                                 12          0.075591    0.075591 (1.000000)     0.000000 (0.000000)     0.002567    0.01396     0.00629925  0.00608606
  fill_constant/prepare_data                                12          0.041932    0.041932 (1.000000)     0.000000 (0.000000)     0.001451    0.010575    0.00349433  0.00337607
slice                                                       6           1.40287     1.382675 (0.985607)     0.020192 (0.014393)     0.078351    0.83448     0.233811    0.112949
  slice/prepare_data                                        6           1.13939     1.127870 (0.989889)     0.011520 (0.010111)     0.045453    0.776968    0.189898    0.0917357
    GpuMemcpySync:CPU->GPU                                  6           0.92333     0.911810 (0.987523)     0.011520 (0.012477)     0.027762    0.731752    0.153888    0.07434
  slice/compute                                             6           0.185768    0.177096 (0.953318)     0.008672 (0.046682)     0.022966    0.041444    0.0309613   0.0149567
  slice/infer_shape                                         6           0.039823    0.039823 (1.000000)     0.000000 (0.000000)     0.004883    0.009277    0.00663717  0.00320627
crop_tensor                                                 6           0.586683    0.563035 (0.959692)     0.023648 (0.040308)     0.082797    0.13253     0.0977805   0.0472356
  crop_tensor/compute                                       6           0.501688    0.478040 (0.952863)     0.023648 (0.047137)     0.068311    0.117511    0.0836147   0.0403924
    GpuMemcpySync:GPU->CPU                                  6           0.230851    0.218243 (0.945385)     0.012608 (0.054615)     0.032632    0.055755    0.0384752   0.0185865
  crop_tensor/infer_shape                                   6           0.03331     0.033310 (1.000000)     0.000000 (0.000000)     0.003598    0.009579    0.00555167  0.00268189
  crop_tensor/prepare_data                                  6           0.009975    0.009975 (1.000000)     0.000000 (0.000000)     0.001146    0.003719    0.0016625   0.000803117
Fetch                                                       6           0.328085    0.316213 (0.963814)     0.011872 (0.036186)     0.047929    0.071537    0.0546808   0.0264151
  GpuMemcpyAsync:GPU->CPU                                   6           0.183378    0.171506 (0.935259)     0.011872 (0.064741)     0.026675    0.037943    0.030563    0.0147643
FastThreadedSSAGraphExecutorPrepare                         6           0.165158    0.165158 (1.000000)     0.000000 (0.000000)     0.016557    0.052761    0.0275263   0.0132974
shape                                                       6           0.160389    0.160389 (1.000000)     0.000000 (0.000000)     0.014121    0.055736    0.0267315   0.0129134
  shape/compute                                             6           0.08775     0.087750 (1.000000)     0.000000 (0.000000)     0.004395    0.03734     0.014625    0.00706501
  shape/infer_shape                                         6           0.023968    0.023968 (1.000000)     0.000000 (0.000000)     0.00276     0.00575     0.00399467  0.00192974
  shape/prepare_data                                        6           0.008976    0.008976 (1.000000)     0.000000 (0.000000)     0.001318    0.001797    0.001496    0.000722684
ScopeBufferedMonitor::post_local_exec_scopes_process        6           0.109541    0.109541 (1.000000)     0.000000 (0.000000)     0.012726    0.029744    0.0182568   0.00881947
eager_deletion                                              18          0.056515    0.056515 (1.000000)     0.000000 (0.000000)     0.00099     0.008829    0.00313972  0.00455019
ScopeBufferedMonitor::pre_local_exec_scopes_process         6           0.016096    0.016096 (1.000000)     0.000000 (0.000000)     0.001069    0.009635    0.00268267  0.00129594

AllOpDetail选项可以打印网络中每个OP的运行时间，在之前的性能分析结果中，所有同类型的OP都归为一类进行分析。AllOpDetail相比于OpDetail，则将网络中不同位置的OP的性能数据一一打印出来，比如网络中存在两个卷积层，使用AllOpDetail选项则会将两个卷积层分别以conv1和conv2的形式打印出性能分析数据，这样可以帮助我们更深入地进行性能分析，其打印结果如下:

-------------------------       Event Summary       -------------------------

Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
fill_constant                                               12          8.99499     8.977549 (0.998061)     0.017439 (0.001939)     0.059922    3.75175     0.749582    0.736424
  fill_constant0                                            6           4.53633     4.527403 (0.998032)     0.008927 (0.001968)     0.084302    3.68861     0.756055    0.371391
    fill_constant0/compute                                  6           4.33761     4.328688 (0.997942)     0.008927 (0.002058)     0.067428    3.60916     0.722936    0.355122
    fill_constant0/infer_shape                              6           0.050087    0.050087 (1.000000)     0.000000 (0.000000)     0.004186    0.018715    0.00834783  0.00410064
    fill_constant0/prepare_data                             6           0.019992    0.019992 (1.000000)     0.000000 (0.000000)     0.002385    0.00582     0.003332    0.00163675
  fill_constant1                                            6           4.36386     4.355344 (0.998049)     0.008512 (0.001951)     0.05556     3.74226     0.727309    0.357271
    fill_constant1/compute                                  6           4.20648     4.197966 (0.997976)     0.008512 (0.002024)     0.03893     3.68969     0.70108     0.344386
    fill_constant1/infer_shape                              6           0.043087    0.043087 (1.000000)     0.000000 (0.000000)     0.003811    0.017195    0.00718117  0.00352755
    fill_constant1/prepare_data                             6           0.016492    0.016492 (1.000000)     0.000000 (0.000000)     0.00199     0.004173    0.00274867  0.00135021
slice                                                       6           1.49876     1.478472 (0.986463)     0.020288 (0.013537)     0.09695     0.936687    0.249793    0.122704
  slice0                                                    6           1.47483     1.454542 (0.986244)     0.020288 (0.013756)     0.093403    0.932261    0.245805    0.120745
    slice0/prepare_data                                     6           1.16992     1.158303 (0.990071)     0.011616 (0.009929)     0.051469    0.86229     0.194986    0.0957818
      GpuMemcpySync:CPU->GPU                                6           0.993707    0.982091 (0.988310)     0.011616 (0.011690)     0.030551    0.813402    0.165618    0.0813552
    slice0/compute                                          6           0.183369    0.174697 (0.952707)     0.008672 (0.047293)     0.025079    0.043277    0.0305615   0.0150125
    slice0/infer_shape                                      6           0.053972    0.053972 (1.000000)     0.000000 (0.000000)     0.006931    0.01328     0.00899533  0.00441871
crop_tensor                                                 6           0.686333    0.662589 (0.965405)     0.023744 (0.034595)     0.092694    0.159363    0.114389    0.0561904
  crop_tensor0                                              6           0.661745    0.638001 (0.964119)     0.023744 (0.035881)     0.088815    0.154756    0.110291    0.0541773
    crop_tensor0/compute                                    6           0.498947    0.475203 (0.952412)     0.023744 (0.047588)     0.069971    0.10652     0.0831578   0.040849
      GpuMemcpySync:GPU->CPU                                6           0.22741     0.214642 (0.943855)     0.012768 (0.056145)     0.034765    0.043249    0.0379017   0.0186182
    crop_tensor0/infer_shape                                6           0.040319    0.040319 (1.000000)     0.000000 (0.000000)     0.005645    0.008416    0.00671983  0.00330093
    crop_tensor0/prepare_data                               6           0.017584    0.017584 (1.000000)     0.000000 (0.000000)     0.001817    0.008174    0.00293067  0.00143961
Fetch                                                       6           0.384527    0.371983 (0.967378)     0.012544 (0.032622)     0.052345    0.084889    0.0640878   0.0314814
  GpuMemcpyAsync:GPU->CPU                                   6           0.178042    0.165498 (0.929545)     0.012544 (0.070455)     0.027737    0.032732    0.0296737   0.0145764
shape                                                       6           0.228702    0.228702 (1.000000)     0.000000 (0.000000)     0.024993    0.064773    0.038117    0.0187239
  shape0                                                    6           0.203062    0.203062 (1.000000)     0.000000 (0.000000)     0.020906    0.059788    0.0338437   0.0166248
    shape0/compute                                          6           0.093917    0.093917 (1.000000)     0.000000 (0.000000)     0.00592     0.039347    0.0156528   0.00768903
    shape0/infer_shape                                      6           0.027612    0.027612 (1.000000)     0.000000 (0.000000)     0.003379    0.006415    0.004602    0.00226061
    shape0/prepare_data                                     6           0.021202    0.021202 (1.000000)     0.000000 (0.000000)     0.001856    0.01158     0.00353367  0.00173582
FastThreadedSSAGraphExecutorPrepare                         6           0.173718    0.173718 (1.000000)     0.000000 (0.000000)     0.01823     0.057948    0.028953    0.0142224
ScopeBufferedMonitor::post_local_exec_scopes_process        6           0.136108    0.136108 (1.000000)     0.000000 (0.000000)     0.014345    0.041874    0.0226847   0.0111432
eager_deletion                                              18          0.092351    0.092351 (1.000000)     0.000000 (0.000000)     0.002197    0.010871    0.00513061  0.00756082
ScopeBufferedMonitor::pre_local_exec_scopes_process         6           0.018935    0.018935 (1.000000)     0.000000 (0.000000)     0.002004    0.00793     0.00315583  0.00155022

总结

飞桨核心框架对cudnn和mkldnn等底层库的使用有着良好的支持，结合这些高性能库的支持以及实现上的精细打磨，飞桨OP性能整体上表现优异。同时飞桨良好的架构设计也保证了整体执行的高效。但少数情况下，如果模型训练性能达不到预期效果，可以使用Profiler工具进行性能分析，帮助开发者在模型开发以及框架开发过程中及时发现和解决性能问题。当然，试用过程中发现OP性能问题和框架开销问题，欢迎开发者积极地通过GitHub issue和我们讨论或者贡献自己的PR，让飞桨深度学习框架的运行速度在更多的硬件上飞起来吧。

关联阅读：

模型训练太慢？显存不够？这个方法让你的GPU联手CPU

如在使用过程中有问题，可加入飞桨官方QQ群：1108045677。

如果您想详细了解更多飞桨的相关内容，请参阅以下文档。

官网地址：

https://www.paddlepaddle.org.cn

飞桨开源框架项目地址：

GitHub:

https://github.com/PaddlePaddle/Paddle

Gitee:

https://gitee.com/paddlepaddle/Paddle

END