How OneFlow does the operator alignment task for static graphs

Article directory

  • 1 Introduction
  • 2 Overview of OneFlow's Graph Operator Alignment
  • 3 Implementation principle of automatic test in Graph mode
    • 3.1 Introduction to AutoTest Process
    • 3.2 How the Graph mode accompanies the Eager mode for operator alignment
    • 3.3 Automated Test Personalization for Graph Mode
  • 4 Debug support for Graph
  • 5 Summary
  • 6 Related Links

1 Introduction

There are two main ways of running models in the deep learning framework, namely dynamic graphs and static graphs. Dynamic graphs are easier to use and static graphs have better performance. OneFlow is accustomed to calling them Eager mode and Graph mode. OneFlow provides the nn.Graph module, which allows users to construct static graph training tests using programming habits similar to the Eager mode. Therefore, it is necessary to ensure the correctness of operator behavior and results in both Eager and Graph modes.

In a previous article, I have talked about how the deep learning framework can elegantly perform operator alignment tasks , and analyzed the automatic testing process of Eager Ops, including how to generate random data test cases and AutoTest core code implementation. The AutoTest framework can be easily transplanted to other deep learning frameworks. However, there is no focus on the automatic testing method of Graph Ops, so the main purpose of this article is to introduce how OneFlow completes the testing task of operators in Graph mode. So far, OneFlow v0.7.0 has added the single test support for static execution of all Ops nn.Graphon , and the automated single test function is complete.

The code location involved in the article:

  • https://github.com/Oneflow-Inc/oneflow/blob/master/python/oneflow/test_utils/automated_test_util/torch_flow_dual_object.py
  • https://github.com/Oneflow-Inc/oneflow/blob/master/python/oneflow/test_utils/automated_test_util/generators.py

2 Overview of OneFlow's Graph Operator Alignment

Eager mode provided by OneFlow, usage is aligned with PyTorch. Therefore, in the test, the AutoTest framework will randomly generate Ops composed of various legal parameters, and run the codes of PyTorch and OneFlow respectively based on the input Tensor with the same value and type (one for PyTorch and OneFlow) to complete the operation of the operator. Align work.

In addition, OneFlow also provides the Graph mode, which is also based on the object-oriented programming style, so that users who are familiar with Eager development can use static graphs efficiently with only a few code changes. Compared with Eager mode, Graph mode is not easy to debug, but has better performance and is easy to optimize and deploy. Then, how to automatically test Ops in Graph mode is the key issue that needs to be paid attention to.

Before introducing the single side of Graph in detail, let's take a look at the Graph opening method in the AutoTest framework. The following is an example of testing the matmul operator. Two random tensors are constructed based on the random_pytorch_tensormethod , and their dimensions are [n, k]and respectively [k, m]. The values ​​of these dimensions are randomly generated. The randomness of the AutoTest framework parameters is based on the generatorbase .

    @autotest(check_graph = True)
    def test_flow_matmul_with_random_data(test_case):
        device = random_device()
        k = random(1, 6)
        x = random_tensor(ndim=2, dim1=k).to(device)
        y = random_tensor(ndim=2, dim0=k).to(device)
        z = torch.matmul(x, y)
        return z

By calling torch.matmul(x, y), the automatic test framework will run the matmul operators of Torch and OneFlow respectively, and will check whether the forward and reverse results of OneFlow and PyTorch operators in Eager mode are consistent. It is worth noting that the switch of the @autotestdecorator is , indicating that the single test of the Graph will be done in parallel at this time.check_graphTrue

3 Implementation principle of automatic test in Graph mode

After understanding the background and usage, let's explain the implementation idea of ​​Graph AutoTest.

3.1 Introduction to AutoTest Process

In Eager's automatic testing principle, how random data is generated and the implementation of autotest()decorators are not the focus of this article, and are clearly introduced in the previous article. Regarding the implementation of the core process of the AutoTest framework, the first thing that must be paid attention to is the GetDualObjectfunction , which is used in the operator alignment tasks of OneFlow and Pytorch. GetDualObjectThis function will override the passed primitive PyTorchand OneFlowthe __call__magic , and finally return an DualObjectobject. This process also includes skipping some magic functions that do not need attention; checking whether the properties of the incoming object are legal; based on the type of nn.Moduleand other API default parameters generatorto the random data generated by the inherited class. The work of binding a specific type ( get_argscompleted in the function) ; In addition, there is a special judgment on the tensormethod , because tensorthe method of calling (pass getattr) and nn.module, nn.functionalare different (pass __call__).

Based on the above process, by executing the sample code, the torch.matmul(x, y)AutoTest framework GetDualObjectwill generate an DualObjectobject by calling the function, of which torchcan be understood as an DualObjectobject. Finally, execute the DualObjectobject to complete the result comparison. More details of the Eager operator alignment in the automatic test process are also clearly introduced in the previous article.

  • GetDualObjectThe function is implemented at: https://github.com/Oneflow-Inc/oneflow/blob/7fe29cb3d24be41fa981c4ad6be3051dacc3b605/python/oneflow/test_utils/automated_test_util/torch_flow_dual_object.py#L600 .

  • DualObjectThe class object is implemented at: https://github.com/Oneflow-Inc/oneflow/blob/0826518cc49200dccada0f54d5c83accb9218c83/python/oneflow/test_utils/automated_test_util/torch_flow_dual_object.py#L784

3.2 How the Graph mode accompanies the Eager mode for operator alignment

From the above analysis, the process of AutoTest can be roughly summarized, generating random data, generating DualObjectobjects , executing DualObjectobjects and judging whether the results are aligned. Among them, the AutoTest framework will execute the Graph version of the OneFlow operator in parallel during the DualObjectobject phase, thus completing the task of aligning the operators with the Eager mode in the Graph mode. In addition, this section also teases out how objects that need static (Graph) execution should be identified in the GetDualObjectfunction .

In the operator alignment task, there are three types nn.module, nn.functionaland tensormethods . Here we take the nn.Moduletype as an example to analyze the code of the Graph mode with the Eager mode test. The processing methods of the other three types are basically the same. The code execution sequence is as shown below.
insert image description here

Called and respectively oneflow_eager_run_with_graph_checkin , got two results of Graph and Eager mode, and finally checked for alignment. That is to say, for a test case, the AutoTest framework executes a total of three codes, namely Pytorch, OneFlow Eager mode and OneFlow Graph mode, to verify whether the three results are aligned. Let's explore this , that is, how to get the calculation result of the Graph version. code show as below:get_module_graph_testget_oneflow_eager_resget_module_graph_test

# NOTE(lixiang): When oneflow is of type nn.Module, build the following Graph for testing.
#   graph_train_oneflow: is a deepcopy of oneflow.
def get_module_graph_test(graph_train_oneflow, oneflow, *args):
    of_sgd = flow.optim.SGD(graph_train_oneflow.parameters(), lr=0.001, momentum=0.9,)
    graph_train_parameters_len = 0
    for param in oneflow._parameters.values():
        if param is not None:
            graph_train_parameters_len += 1

    class TestGraphOfModule(flow.nn.Graph):
        def __init__(self):
            super().__init__()
            self.test_module = graph_train_oneflow
            if global_backward and graph_train_parameters_len:
                self.add_optimizer(of_sgd)

        def build(self, *args):
            res = self.test_module(*args)
            forward_res = res
            if global_backward and graph_train_parameters_len:
                res = res.sum()
                res.backward()
            return forward_res

    return TestGraphOfModule()

Among them oneflowis an nn.Moduleobject, which graph_train_oneflowis its deep copy result, mainly to prevent the corresponding DualObjectobject , resulting in inconsistent test results due to inconsistency between the input of Graph and Eager. First, in order to verify that the backward of Graph can be executed normally, an Optimizer is constructed. __init__After reusing nn.Moduleobjects in Eager mode in , the calculation process of the Graph test buildis described in , and finally an instance of Graph is returned. In short, it is to construct a general static graph model that adapts to all operators.

After discussing how to construct statically executed code to calculate Graph results, identifying objects that require static execution is also a priority to be solved. oneflow_eager_run_with_graph_checkThe complete code is as follows:

# NOTE(lixiang): Check if the results of eager and graph are equal when oneflow is of type nn.Module or functional.
def oneflow_eager_run_with_graph_check(
    oneflow, oneflow_args, oneflow_kwargs, testing_graph, verbose, *args
):
    if testing_graph:
        graph_args, graph_kwargs = get_args_copy(oneflow_args, oneflow_kwargs)

        if isinstance(oneflow, flow.nn.Module):
            graph_train_oneflow = copy.deepcopy(oneflow)
            if not is_global():
                arg_device_type = "cpu"
                for arg in oneflow_args:
                    if flow.is_tensor(arg):
                        arg_device_type = arg.device.type
                graph_train_oneflow = graph_train_oneflow.to(arg_device_type)

        else:
            graph_functional_oneflow = copy.deepcopy(oneflow)

    oneflow_res = get_oneflow_eager_res(oneflow, oneflow_args, oneflow_kwargs, verbose)
    if testing_graph:
        if verbose:
            print(
                "After running eager module or functional: ", repr(oneflow),
            )
        find_check_module_func = True
        ignore_apis_list = ["tensor", "train"]
        test_g_res = []
        if isinstance(oneflow, flow.nn.Module):
            test_g = get_module_graph_test(graph_train_oneflow, oneflow, *args)
            if verbose:
                print("Run graph of module: ", repr(oneflow))
                test_g.debug(3)
            # When testing module methods, kwargs are not considered.
            test_g_res = test_g(*graph_args)
            if verbose:
                print(
                    "The result after running graph module: ", test_g_res,
                )
        elif oneflow.__name__ in ignore_apis_list:
            find_check_module_func = False
        # 1. "oneflow.nn.modules" not in oneflow.__module__: For avoid run nn.Module branch graph test, like fold op call Fold Module actually.
        # 2. inspect.isfunction(oneflow): Compared with the ordinary flow.xxx, oneflow.nn.modules.math_ops series op exist an extra layer of python wrapper.
        # 3. inspect.ismethod(oneflow) and "oneflow.nn.modules" in oneflow.__module__:  For op that only has Tensor.xxx method, and call oneflow.xxx actually, like masked_fill.
        elif (
            ("oneflow.nn.modules" not in oneflow.__module__)
            or inspect.isfunction(oneflow)
            or (
                inspect.ismethod(oneflow) and "oneflow.nn.modules" in oneflow.__module__
            )
        ):

            test_g_res = get_functional_graph_res(
                graph_functional_oneflow,
                oneflow,
                oneflow_res,
                oneflow_args,
                oneflow_kwargs,
                verbose,
                *graph_args,
                **graph_kwargs,
            )
        if find_check_module_func:
            if isinstance(test_g_res, tuple):
                for _, g_res in enumerate(test_g_res):
                    check_eager_graph_tensor(oneflow_res, g_res)
            else:
                check_eager_graph_tensor(oneflow_res, test_g_res)
    return oneflow_res

oneflow_eager_run_with_graph_checkIn , you need to decide which objects need to be tested statically. Because the OneFlow setting part of the code needs to be static, for example, some methods in Eager mode are not defined in Graph mode. In the above code, first if testing_graph:judge whether the Graph switch is turned on by , that is, whether it is necessary to do the single test of Graph in parallel; and then judgeoneflow the type of the object . When it is , it needs to be statically executed and called . Otherwise, call and other processing, and the same is true for other places in the test framework that require similar judgments.isinstancenn.Moduleget_module_graph_testget_functional_graph_res

    if testing_graph:
        ···
        ···
        if isinstance(oneflow, flow.nn.Module):
            ···
            test_g = get_module_graph_test(graph_train_oneflow, oneflow, *args)
            ···
        elif:
            ···
            ···

3.3 Automated Test Personalization for Graph Mode

After introducing how Graph does the task of operator alignment with Eager mode in 3.2 , this section mainly analyzes the personalized content of automatic testing in Graph mode.

In the Graph mode, three categories of methods need to be processed nn.module, and the AutoTest framework adopts the method of first judging and then composing the graph. First of all, in the function, the related interfaces include: , , , , , , and . Take a look at the function of each interface, as shown in the following table.nn.functionaltensorGetDualObjectget_pytorch_oneflow_resget_pytorch_oneflow_tensor_resoneflow_eager_run_with_graph_checkoneflow_tensor_eager_run_with_graph_checkget_oneflow_eager_resget_tensor_graph_resget_functional_graph_resget_module_graph_test

Function Info
get_module_graph_test Returns the Graph instance of the operator in the nn.modulemodule .
get_functional_graph_res Returns the Graph calculation result of the operator in the nn.functionalmodule .
get_tensor_graph_res Returns the Graph calculation result of the operator in the tensormodule .
get_pytorch_oneflow_res Obtain the calculation results of OneFlow and Torch, respectively.
get_pytorch_oneflow_tensor_res A specialization of tensor Ops, ibid.
oneflow_eager_run_with_graph_check Tests whether eager patterns are aligned, accompanied by a Graph check.
oneflow_tensor_eager_run_with_graph_check A specialization of tensor Ops, ibid.
get_oneflow_eager_res Obtain the calculation result of the OneFlow Eager mode operator.

nn.moduleAfter roughly understanding the function of each function , letnn.functional 's take a look at the call chain. tensor.
insert image description here

After analyzing the processing methods of nn.module, andnn.functional , among them, there is also a reverse gradient test when automatically testing Graph, but the gradient corresponding to tensor is not taken out, that is to say, it can be guaranteed that the backward execution is normal, tensorThe grad value is not checked. For the usage method, when it is @autotest()turned auto_backward=Trueon (it is turned on by default), it will not only run the Backward test of Eager (the gradient results will be compared here), but will also run the Backward test of the corresponding Graph (the gradient comparison will not be done here).

The code corresponding to the above description can be found in the code in section 3.2 of the article :

if (
  global_backward
  and graph_train_parameters_len
):
  self.add_optimizer(of_sgd)
···
···
···
if (
  global_backward
  and graph_train_parameters_len
):
  res = res.sum()
  res.backward()

In addition, for some operator inplace versions of Graph checks, it is necessary to make a deep copy of the input to ensure that the input of Graph and Eager are always consistent. In the following code, get_args_copy(in torch_flow_dual_object.py) do deepcopy for common parameters and keyword parameters respectively. Similarly, in the single side of Graph, there is the behavior of oneflowdeep copy as graph_train_oneflow, mainly to prevent the value of Eager from being modified by Eager Inplace when testing some operators, resulting in inconsistency between the input of Graph and Eager, which leads to test errors.

# NOTE(lixiang): Deepcopy the input parameters in order to correctly test the inplace version of the op.
def get_args_copy(args, kwargs):
    copy_args = []
    for arg in args:
        if flow.is_tensor(arg):
            copy_arg = arg.clone().detach()
        else:
            copy_arg = copy.deepcopy(arg)
        copy_args.append(copy_arg)
    copy_kwargs = {
    
    }
    for key, value in kwargs.items():
        if flow.is_tensor(value):
            copy_kwargs[key] = value.clone().detach()
        else:
            copy_kwargs[key] = copy.deepcopy(value)
    return copy_args, copy_kwargs

Finally, in order to ensure the correctness of tensor deepcopy, in OneFlow, copy.deepcopythe getStateand setStatethe state of tensor needs to include data, dtype and device information at the same time, all of which are indispensable. See the specific code: https://github.com/Oneflow-Inc/oneflow/blob/e00ba51364ff87e39edc409be395e5ed493a4ac0/python/oneflow/framework/check_point_v2.py#L159 .

4 Debug support for Graph

In the code of 3.2 , you can find the judgment that exists. When is , the debug information of the Graph (such as the calculation result after the operator runs the Graph mode, etc.) will be output, and of course other required debugging information under eager. When there is a problem with the test, you can use this function to get the error sample and construct the minimum reproduction code. The opening method is controlled by environment variables: . This function in the AutoTest framework is more aimed at developers, and OneFlow's Graph also provides debugging functions for users.if verbose:verbose = TrueONEFLOW_TEST_VERBOSE = 1

The Graph mode supports the debug output of the learning rate, and the opening method is the same as that of Eager.

optimizer = flow.optim.SGD(model.parameters(), lr=1e-3)
# Set verbose=True
scheduler = flow.optim.lr_scheduler.CosineDecayLR(optimizer, decay_steps=100, alpha=0.98, verbose=True)

In addition, calling the debug method of the Graph object enables the debug mode of the Graph.

graph.debug(v_level = 1) # 可以简写为:graph.debug(1)
  • v_level=0, only the most basic warnings and composition stage information, such as composition time, are output.
  • v_level=1, nn.Modulethe .
  • v_level=2, in the composition stage, the creation information of each Op will be additionally printed, including name, input content, equipment and SBP information, etc.
  • v_level=3, more detailed information of each Op will be printed additionally, such as information related to the code position, which is convenient for locating code problems.

More details on this part can be found at https://docs.oneflow.org/master/basics/08_nn_graph.html#graph_3.

5 Summary

The flexibility and ease of use of the AutoTest framework are relatively strong. This article mainly introduces how the Graph mode accompanies the Eager mode for operator alignment and Graph's automatic testing of personalized content. The local ops execution test coverage from Eager to Graph has also been completed in OneFlow v0.7.0, and the 0.8 version will ensure the correctness of the Graph Global ops unit test. In addition, the debug and other functions of static graphs will be more complete. Everyone is welcome to learn or use.

6 Related Links

  • https://github.com/Oneflow-Inc/oneflow
  • https://github.com/pytorch/pytorch

Guess you like

Origin blog.csdn.net/weixin_43838785/article/details/124298222