From data to large model application, on November 25th, Hangzhou Yuanchuang Conference was held to share development tips.

0.Write in front

"xx, R's response to multi-machine training is slow, please take a look at what's going on."

"xxx, why is the network training of xxx slowed down after the MGE update? Please take a look."

This is a daily conversation within the group

Then someone takes the blame every day

Is the status of the team members: improving performance, improving performance, or TMD improving performance?

According to incomplete statistics, 80% of performance problems are actually due to the fact that the training code is not written well enough, which makes MGE unable to use it effectively.

Including but not limited to the following situations

1) fast_run is not enabled

2) Frequently use numpy for synchronization

3) Do not use make_allreduce_cb, let the calculation communication be serial

4）。。。

After doing it many times, I found that this thing takes too much time, and the steps are the same every time. Why do I have to do it, so I wrote this article to summarize it for everyone's convenience and for my own convenience.

1.Profiler introduction

First we need to understand the Profiler thing

Simply put, Profiler records the running time of all operators in the form of a timeline

Through the Profile results, we can quickly find out why this code runs slowly.

Did you do extra work? Or is it a waste of computing resources? Or the performance of the operator itself is very poor and needs to be replaced with another operator.

This is a simple profile result display

In most cases, we only focus on gpu threads. Each gpu thread corresponds to a cuda stream, and the above are all operators running on this cuda stream.

2. How to use

PS: The statistical information of static images is not perfect enough (affected by graph optimization), and the profile results are not friendly enough compared to dynamic images.

from megengine.utils.profiler import profile, Profiler
 
# 装饰器写法
@profile()
def train_step(data, label, *, optimizer, gm, model)
    with gm:
        logits = model(data)
        loss = F.loss.cross_entropy(logits, label)
        gm.backward(loss)
        optimizer.step().clear_grad()
    return loss
 
# with 写法
# 训练过程中最好只有一个profiler实例，因为profiler会在析构时自动dump出结果，如果有多个实例的话每个iter都会dump，非常慢
profiler = Profiler()
 
def train_step(data, label, *, optimizer, gm, model)
    with profiler:
       with gm:
           logits = model(data)
           loss = F.loss.cross_entropy(logits, label)
           gm.backward(loss)
           optimizer.step().clear_grad()
    return loss

⚠️Note that profiler will export the profile results during destruction by default. You can also manually call the profiler.dump method to dump manually.

Parameter Description:

Profiler The constructor supports the following parameters:

path: The storage path of profile data, which defaults to the profile folder under the current path.
format: The format of the output data, the default is chrome_timeline.json, which is a standard format supported by Chrome and displays profiling results in the form of a timeline. There is also memory_flow.svg as an option, which displays memory usage in the form of time x address space. Condition.
formats: If more than one output format is required, it can be listed in the formats parameter.
sample_rate: If this item is not zero, the video memory information will be counted every n ops, and the video memory usage curve can be drawn when analyzing the data. The default is 0.
profile_device: Whether to record gpu time consumption, the default is True.
with_scopes: Whether to additionally record the scope corresponding to python API such as functional/tensor method, the default is False.
with_backtrace: Whether to record the python call stack corresponding to the op/event. The default is False. Turning it on will increase the size of the recorded data file.

Scope usage introduction

We will automatically add scope to the forward, backward and step steps of the module. The scope will be displayed on the host thread, and you can see which module the op belongs to and which stage.

You can also add scope yourself

from megengine.utils.profiler import Profiler, scope
 
def main()
    with Profiler():
        x = Tensor(1)
        with scope("Add"):
            y = x+1
        with scope("Mul"):
            z = x*3

with_scopes = TrueBy default, profiler will only record module forward, backward, and step scopes. Users can pass in parameters for additional records when constructing the Profiler object functional, tensor methodand apicall the corresponding ones scope.

After turning on the option, additional internal calls with_scopesare recorded.BatchNorm2d Modulefunctional / tensor method APIscope

After turning on the with_scopes option, an additional backward scope is recorded. This scope is used to record the forward operator corresponding to the reverse calculation sequence. It can be used to find which operator generates the forward calculation for the operator with performance problems in the reverse calculation. The scope in the figure below represents Broadcast, SetSubtensor and other operators are generated by the reverse calculation of Subtensor.

3. Visual display

It is recommended to use perfetto to view the profile results. You can also use the Performance module of Chrome developer mode (F12) to view the timeline format file. You can also use chrome://tracing/ to view it .

The following introduces the operation methods based on perfetto

1) Statistics

You can select a continuous time period to view the statistical results of this time period.

The event statistics results will be displayed below. You can see the actual occupied time of the event (Wall duration) (the idle time can be calculated based on the total time). You can sort by the total occupied time or by the average time.

2) Dependencies

On the host thread, the op will record the corresponding input and output and the corresponding dependencies. You can find the previous op that the input depends on according to the arrow, or you can click on the flow event below to move to the previous or next one.

We can also find the host time and gpu time corresponding to the op. Click on the op to see the time occupied by different threads (cpu, gpu)

3) View indicators such as video memory usage and gpu utilization

In addition to recording the execution time of the time operator, the profiler also records some indicators related to video memory and performance. gpu_usage_ratio records the average gpu utilization of program execution (the proportion of gpu execution kernel time to the total time). A low gpu_usage_ratio indicates that the host side of the program is a bottleneck. gpux:x alloc_mem_value records the curve of gpux video memory usage changing over time. Sample_rate needs to be set to an integer greater than 0 (sampe rate means recording video memory usage every n ops)

4) Zoom in and out

You can drag the start and end points of the timeline above to modify the start and end points, or you can zoom in and out using zoom gestures.

The small gray square above the middle vertical line is the point that can be dragged

4. Common debugging techniques (with usage examples)

1) Redundant calculations

In the yolox example, the forward, backward, and step operations are completed, but there are a lot of reshape operations later (it is generally believed that reshape has no actual calculation, so it is basically regarded as a waste

After finding the cause, the results are as follows (5s->1.3s)

2) Calculate communication serial (please look for make_allreduce_cb)

Allreduce communication is on gpu0:1. If you find that the communication is on gpu0:0, you are using it wrong.

3) The host performance is slow and the GPU utilization is not high.

The cpu time and gpu time are basically the same, which is very suspicious.

Zoom in and look carefully. There are many gaps in the gpu running time. Click on the corresponding op to view the dependencies. You can see that the gap time in the middle is waiting for the host to launch cuda kernel.

4) Use the backtrace recording function to find the source code corresponding to the performance bottleneck part

The above example introduces how to find abnormal performance parts from the profile results. The profiler provides the backtrace call stack recording function to facilitate users to find the training code source code corresponding to the abnormal parts. The backtrace record will record the dispatch/kernel execution of the operator and the python call stack corresponding to TensorWaitProperty and other events.

You can turn on the call stack recording function by passing with_backtrace = True when constructing the Profiler object. After turning on this option, the size of the profiler's saved data file will increase.

Users can click op on the perfetto UI interface to view its corresponding source code.

In the profiler results in the figure below, the CompiledOp[IOU] operator takes a long time to execute. Through the recorded backtrace, we can find that this operator is called in the loss calculation part of the detection model.

In the figure below, a certain TensorWaitProp in the interpreter thread takes a long time, which may slow down the host execution speed and cause the GPU to become idle.

(TensorWaitProp may be generated by tensor.shape, tensor.numpy() and other method calls, which will cause the host side to wait for device execution to obtain the value or shape attribute of Tensor)

Through the call stack, it can be found that the event is generated by a getitem in the get_ground_truth method of the basedet detection model (the tensor shape attribute is used in __getitem__ to trigger the sync on the host side).

attached

For more MegEngine information, you can: view the documentation and GitHub project , or join the MegEngine user communication QQ group: 1029741705. Welcome to contribute to the MegEngine community and become an Awesome MegEngineer , and enjoy endless certificates of honor and customized gifts.

MegEngine usage tips: Profiler manual