OneFlow source code analysis: static graph and runtime

e280ea25e4fec9b711e37c34b99cec10.jpeg


Author|Jianhua Zheng
Update|Xu Xiaoyu, Zhang Wenxiao, Cheng Cheng


The training efficiency of OneFlow static images is much higher than that of dynamic images (eager mode). This article tries to explain the implementation mechanism of static graph and runtime through a simple example, combined with the code of v0.8.0 version .

Before starting, it is recommended to read the series of articles such as " System Design of OneFlow Framework ( https://zhuanlan.zhihu.com/p/337851255 )" in the reference materials. With a basic understanding of static graphs, basic concepts of runtime and design philosophy, it will be easier to understand the code.

code example

The following sample code comes from the official documentation ( https://docs.oneflow.org/master/basics/08_nn_graph.html ), which is a forward calculation of a linear model. The subsequent analysis is mainly based on this code.

import oneflow as flow
import oneflow.nn as nn


class ModuleMyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(flow.randn(in_features, out_features))
        self.bias = nn.Parameter(flow.randn(out_features))


    def forward(self, input):
        return flow.matmul(input, self.weight) + self.bias


linear_model = ModuleMyLinear(4, 3)


class GraphMyLinear(nn.Graph):
    def __init__(self):
        super().__init__()
        # ModuleBlock
        self.model = linear_model


    def build(self, input):
        # ModuleBlock.__call__
        return self.model(input)


graph_mylinear = GraphMyLinear()
input = flow.randn(1, 4)
out = graph_mylinear(input)
print(out)

Initialization of the oneflow package

When import oneflow initializes the package ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py ), the main operations related to the static graph are as follows:

  • GetEnv(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L228

    • EnvGlobalObjectsScope::Init(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L126

      • Start the control plane of each node ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L160-L162 ) network connection

      • 初始化VM(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L180

      • Start the data plane network connection of each node ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L184-L188

      • 初始化KernelObserver(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L192-L203

  • NewDefaultSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L229

    • RegsiterSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/multi_client_session.py#L39) 创建 Session,并注册为 default session(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/session_util.cpp#L89

    • Create Python MultiClientSession and save to dict ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/session_context.py#L40 ), but don't TryInit

      • Create C++ MultiClientSessionContext ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/multi_client_session.py#L41 ) but does not TryInit

In EnvGlobalObjectsScope::Init, first create a global ProcessCtx ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L132 ) object. Then, according to configurations such as environment variables, a connection between gRPC and CommNet is created between each process, which are respectively responsible for data transmission on the control plane and the data plane. Among them, the global ProcessCtx ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/grpc.cpp#L42 ) will be initialized during the Bootstrap process, and each Each process is assigned a globally unique rank number ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/global_process_ctx.cpp#L28 ) (machine_id ( https ://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/global_process_ctx.cpp#L24 )).

This article does not involve operations at the network level, but only discusses the interaction between threads in the same process.

Module class

Although you can directly use op and tensor to construct a model, but the granularity of op is too fine, it will be cumbersome to directly use op to construct a model.

Module ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/module.py#L54 ) is a reusable submodule composed of op and tensor . Using Module can build complex models more efficiently and quickly. oneflow.nn ( https://github.com/Oneflow-Inc/oneflow/blob/d825243aa7aff5cba8bd3a901b4cc56c2b1a36af/python/oneflow/nn/__init__.py ) module exports many predefined Modules.

Module defines its own property setting logic ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/module.py#L262 ), the core logic is

  • If the value is a Parameter type, save it in Module._parameters

  • If value is of Module type, save it in Module._modules

  • If value is of Tensor type, save it in Module._buffers

  • Otherwise, treat it as a normal attribute

Module can contain sub-modules to form a tree structure. Because the Module saves the sub-Module and Parameter in the dictionary structure through setattr, it is convenient to traverse all Modules and their parameter tensors.

Graph class

4.1 Constructor

The session obtained by GetDefaultSession ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L145 ) in the constructor of Graph is to import oneflow The session built by NewDefaultSession ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L229 ) when packaged. It was not initialized at that time, but initialized when the Graph was constructed ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L147 ). The corresponding C++ function is MultiClientSessionContext::TryInit ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/multi_client_session_context.cpp#L67 ), which will create each A global resource manager, such as:
 

  • LazyJobBuildAndInferCtxMgr

  • BufferMgr

  • RegstMgr

  • ActorMsgBus

  • ThreadMgr

4.2 __setattr__: Encapsulate Module and Tensor as Block

Graph.__setattr__ supports adding a Module to the Graph by setting attributes, and then changing the Module can be called by the Graph. The Module added to the Graph will be packaged into a Block, and the Block acts as a proxy for execution. It will extend the original Eager Module with some special functions required for static execution.

The Module added to the Graph shares the state (Parameter, Buffer) and forward execution logic with the original Module. Shared forward execution logic makes static and dynamic execution calculation logic the same. Shared state allows the model state under the dynamic diagram to be reused by the static diagram. Based on this, two graphs, one for training and one for prediction, both reuse the unified model module, so that the training and prediction graphs also realize model sharing.

The most important action of setattr is the call to _add_block ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1332 ), _add_block Mainly call get_block_cls and save the result ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1326 ). The function of get_block_cls ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L39 ) is to transfer the Module and all its Tensor attributes to is the corresponding Block object. Why do you do this action? The main reason is that static graph compilation needs to use the Block type to realize the functions executed by the agent, and these functions are not suitable for writing directly to the Module and Tensor under eager.

This conversion is done by calling set_origin ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L131 ) when the ModuleBlock is constructed. For the sub-Module, the get_block_cls function ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L145 ) will be called recursively , so that all sub -modules Both the Module and its Tensor properties will be converted into corresponding Block objects.

Therefore, in the above sample code, GraphMyLinear actually stores ModuleBlock, the model attribute obtained when Graph.build is executed is also a ModuleBlock object, and ModuleBlock.origin is ModuleMyLinear.

Graph.__setattr__ does not allow setting Tensor objects as attributes ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1340 ). Tensor can only be stored in Module, because Module is the basic unit for state sharing, and Graph is not allowed to be reused.

4.3 Define different calculation graphs for different tasks

According to the model example of Oneflow Model Zoo ( https://github.com/Oneflow-Inc/models/blob/1b291f78d8f60e5f04ee0c5962e4611cc4bab40a/Vision/classification/image/alexnet/graph/train.py ), stages such as train/eval can create different Graph subclass. Modules such as Module, Optimizer, and Dataloader are provided under the dynamic graph, and these models can be added to the Graph. Different combinations can build different types of tasks.

In these different stages, the behavior of the Graph constructor and the input and output of the build function have their own characteristics. Knowing these, it will be easier to understand the specific meaning of each parameter when looking at the subsequent code.

  • Constructor

    • In the train phase, you need to add Module, loss function, optimizer and dataloader

    • In the eval stage, only need to add Module and dataloader

  • build function

    • train

      • Import samples and labels

      • Call the Module to get the forward calculation result

      • calculate loss

      • Calculate the gradient

      • return loss

    • eval

      • Import samples and labels

      • Call the Module to get the estimated result

      • Return the estimated result and label

4.4 Summary

The relationships between the above types are as follows:

a714129bd3b8cbfd483f412b01a5223b.png

The following describes the construction process of GraphMyLinear

* `__init__`
  * `Graph.__init__`
  * self.model = linear_model
    * `Graph.__setattr__`
      * _add_block
        * get_block_cls: 递归地把Module转为ModuleBlock
        * `ModuleBlock.__init__`
          * ModuleBlock.set_origin
            * `ModuleBlock._origin = origin` (Module)
            * 对origin的sub modules, parameters, buffers递归调用get_block_cls
            * `ModuleBlock.__setattr__`

Compilation of logic diagrams

The compilation of computer language is to compile the statements of high-level language into assembly or machine instructions. The compilation of computing tasks by the deep learning framework is to convert the user's specific statement operations into a DAG graph. In oneflow, Job ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job.proto#L30 ) is used to describe the calculation graph of logic.

Different from the dynamic graph in eager mode, the static graph can obtain all the information of the entire computing task before starting to execute, and can perform multiple rounds of optimization on the DAG. Each round of optimization is to input a job and get a new job.

Finally, according to the distributed environment configuration, the logical graph Job is converted into a physically executed computing graph Plan ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/plan .proto#L34 ). In a physical graph, an op may be distributed across multiple nodes/processes.

To start the DAG calculation, you need to call Graph.__call__. The execution of this function is mainly divided into the following steps:

  • __call__

    • _compile(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L221) if not _is_compiled

      • build_graph(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L741

        • __build_graph(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L759

      • finish_complie_and_init_runtime(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L742

    • __run(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L226

Logic graph compilation is mainly performed in __build_graph. finish_complie_and_init_runtime will continue to do some optimization passes, then build the physical map, and initialize the runtime Actor system. __run will start a DAG operation.

5.1 graph_build_context: Set up the basic environment for logic graph compilation

In Graph, the code execution in the build function is under the scope of graph_build_context, which realizes the function of dynamic conversion to static.

graph_build_context in __build_graph ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L851 ) has only one line of code, but it does A couple of very important things.

First set the global lazy_mode to True in the context scope ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L46 ). Within this context scope, all ops are interpreted and executed by LazyInterpreter.

Second, within the scope of JobBuildAndInferCtx ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L47 ), JobBuildAndInferCtx_Open ( https://github. com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L57 ) call C++ code similar to the following

// oneflow/api/python/job_build/job_build_and_infer.h
// oneflow/core/job/job_build_and_infer_ctx_mgr.cpp


// 如前所述,LazyJobBuildAndInferCtxMgr 在 MultiClientSessionContext::TryInit 执行时初始化。
// LazyJobBuildAndInferCtxMgr mgr;
mgr.OpenJobBuildAndInferCtx(job_name);

OpenJobBuildAndInferCtx will create a new Job object ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx_mgr.cpp#L32 ), a LazyJobBuildAndInferCtx object ( https:// github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx_mgr.cpp#L34 ). LazyJobBuildAndInferCtx is responsible for modifying the Job according to user-customized op and other operations, the most important function of which is to add new Op.

5.2 __build_io: Add input and output Op to the calculation graph

self.__build_io("input", graph_build_util.build_graph_input_arg, *args, **kwargs)

The above line of code ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L854-L856 ) is for the user to pass For the input parameter of graph_mylinear(input), a FeedInputOp is inserted into the logic calculation graph for each tensor in it ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core /framework/system_ops.h#L48 ) node. That is to say, the input of the model (such as the sample tensor, refer to Section 4.3 for details), is also regarded as an op operation in the static graph.

__build_io will use args (ie input) and kwargs to construct an ArgsTree. ArgsTree abstracts the input and output under Python into a tree. The input and output can be nested Tuple, List, Dict, and the element is Tensor. The nested structure can just be represented as a tree, and Tensor is the leaf node in the tree. . The kwargs are empty in the sample code.

Traverse the ArgsTree, call the incoming build_func for each tensor of args and kwargs, for input, it is build_graph_input_arg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/ oneflow/framework/graph_build_util.py#L206 ). As you will see later, the output of the model will also call __build_io, so the meaning of this function name should be to compose the static image of the input and output of the model.

build_graph_input_arg will construct a FeedInputOpExpr ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L213 ) and submit it to the interpreter for execution. Because it is in the lazy scope, it is interpreted and executed by LazyInterpreter ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L471 ), LazyInterpreter The corresponding op will be inserted into the static graph.

Attachment: The internal structure of ArgsTree when building input

0a6b6c85614174cdc63f23cb4422c0ce.png

Internal data organization of ArgsTree in __build_io(input)

  • _named_io_args: NamedArg

    • _value: tuple

      • [0]: NamedArg

        • _value: tuple of NamedArg

          • [0]: NamedArg

            • _value: args tensor from Graph.__call__

      • [1]: NamedArg

        • _value: empty kwargs from Graph.__call__

Variables can be viewed through the pdb command: p args_tree._named_io_args._value[0]._value[0]._value.to_numpy()

5.2.1 Add op to logic diagram

LazyInterpreter::ApplyImpl ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L471 ) is executed, GetCurInferCtx() ( https ://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L500 ) returns OpenJobBuildAndInferCtx in graph_build_context ( https://github.com/ Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L57 ) creates the LazyJobBuildAndInferCtx object, which is responsible for the construction of the logic graph. The main calling process of adding op is as follows:

  • infer_ctx->AddAndInferConsistentOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L503

  • AddAndInferOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L563

  • ConstructOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L580

  • CheckAndConstructOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/operator.cpp#L1216

  • NewObj(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/operator.cpp#L51


In OperatorConf, multiple op configurations share the op_type field ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/op_conf.proto#L412 ), the op_type_case of protobuf oneof The constant is used as the key to register NewObj.

The ops predefined by the system are under oneflow/core/operator ( https://github.com/Oneflow-Inc/oneflow/tree/release/v0.8.0/oneflow/core/operator ), such as UserOp ( https://github .com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/user_op.h#L24 ).

AddAndInferOp saves the returned Operator into the dictionary of LazyJobBuildAndInferCtx. Subsequent function calls are mainly to deduce and modify the static graph Job, so that each node constitutes a DAG.

The class relationship related to JobBuildAndInferCtx is as follows:

0820ba8658ba6b061b7efe19c14ef22f.png

5.2.2 The difference between lazy tensor and eager tensor

At the end of LazyInterpreter::ApplyImpl, BuildTensor ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L518 ) will be called to construct a lazy tensor, as the return value of build_graph_input_arg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L216 ). So lazy_args ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L854 ) returned by __build_io is lazy tensor, which will replace The eager args ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L828 ) (that is, the input entered by the user) participates in the follow-up Computational graph construction.

So what is the difference between lazy tensor and eager tensor? The eager tensor is to be calculated in real time, so real data is required; while the lazy tensor is only used for derivation in the static graph compilation phase, and only meta information describing the nature is required. Static graph compilation runs in lazy mode, and only uses lazy tensor for computer graphing and verification.

As you will see later, there is no concept of tensor during the runtime of static graphs. What you see at runtime is only Regst storage in a broader sense, which may represent tensor/blob, or other control information. The input when the static image is running is to directly read the memory data of the external eager tensor to regst; the output should be op written to regst, and the eager tensor is constructed through the blob.

5.3 build: Add UserOp and FeedVariableOp to the logic diagram

self.build() in __build_graph ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L861 ) will call GraphMyLinear.build (), and ModuleMyLinear.forward(). Because it is running in lazy mode, both matmul and add will call the UserOpExpr overloaded version of LazyInterpreter::ApplyImpl ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework /op_interpreter/lazy_op_interpreter.cpp#L832 ), and then call AddAndInferConsistentOp ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L940 ) Perform composition operations.

It should be noted that when referring to the Parameter attribute of Module (such as weight/bias), the composition operation of FeedVariableOp will be triggered ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/ oneflow/framework/graph_build_util.py#L226 ), calling the corresponding version of LazyInterpreter::ApplyImpl ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/ lazy_op_interpreter.cpp#L527 ). How is this implemented?

In __build_graph, before entering lazy mode, _create_states_builder is called ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L843 ). Where self._state() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L667 ) returns all Parameters of all Modules ( Including sub-Module).

The type of state_block is TensorBlock ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L631 ). The lazy_origin_builder().method of all state_blocks ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L647 ) are set to Call build_graph_state ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L683-L688 ).

Setting a breakpoint for build_graph_state ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L220 ) can make the entire calling process visible, the main The call stack is as follows:

-> out = graph_mylinear(input)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(221)__call__()
-> self._compile(*args, **kwargs)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(741)_compile()
-> _, eager_outputs = self.build_graph(*args, **kwargs)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(759)build_graph()
-> outputs = self.__build_graph(*args, **kwargs)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(864)__build_graph()
-> outputs = self.build(*lazy_args, **lazy_kwargs)
  /mnt/project/machine-learning/oneflow/oneflow/test.py(21)build()
-> return self.model(input)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(234)__call__()
-> result = self.__block_forward(*args, **kwargs)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(266)__block_forward()
-> result = self._origin.__class__.forward(self, *args, **kwargs)
  /mnt/project/machine-learning/oneflow/oneflow/test.py(11)forward()
-> return flow.matmul(input, self.weight) + self.bias
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(483)__getattr__()
-> p_state = self._get_from_states(name, "_parameters")
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(521)_get_from_states()
-> _s_block.try_build()
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(679)try_build()
-> self._lazy_origin_builder.try_build(self)
  /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(627)try_build()
-> self.result = self.method()
> /usr/local/lib64/python3.6/site-packages/oneflow/framework/graph_build_util.py(227)build_graph_state()
-> op_name, var_conf_str, ["in_0"], ["out_0"]

What is more troublesome about this calling process is that the execution object will switch between Grpah, GraphMyLinear, ModuleMyLinear, and ModuleBlock.

As mentioned earlier when discussing the construction of Graph, when executing self.model(input), the attribute model returned by Graph.__getattr__ is a ModuleBlock object, so the actual call is ModuleBlock.__call__.

Call __block_forward ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L234 ) inside this function , where _origin ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L266 ) is ModuleMyLinear, enter its forward method, and execute flow. When matmul(input, self.weight) + self.bias, matmul will be executed by LazyOpInterpreter, and AddAndInferConsistentOp will be called in LazyOpInterpreter ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow /core/framework/op_interpreter/lazy_op_interpreter.cpp#L503 )

, add a matmul operator in Job. Similarly, the subsequent addition will add an add operator to the job.

self.weight and self.bias will trigger the call to ModuleBlock.__getattr__, which in turn calls _get_from_states ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block .py#L483 ), calling TensorBlock.try_build() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L521 ). What is executed here is the build_graph_state set before entering lazy mode ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L220 ). Thus adding a FeedVariableOp to the calculation graph ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L527 ). Why are settings and calls so far apart? The main purpose is to make the parameters as close as possible to the Operator that consumes the parameters in the same scope, so the implementation is lazy evaluation to achieve the purpose of delayed calculation.

The next step is to call __build_io ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L869-L875 ) to insert FetchOutputOp ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L589 ). In other words, getting the output of the model is also an op.

So far, the forward computation graph has been constructed. Its json representation can refer to the appendix. net.op is the node of the calculation graph, and the connection relationship between nodes can be seen through attributes such as input.

The forward computation graph of the sample code is as follows. As can be seen from this figure, input, output, weights, etc. are all op.

daa1b8f1e66b6e63587631b427c33760.png

5.4 Logic Diagram Optimization

In __build_graph, CurJobBuildAndInferCtx_Complete will be called to perform multiple rounds of optimization on static graphs ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L923 ), the corresponding C++ function is LazyJobBuildAndInferCtx::Complete() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L975 ).

The job generated after this is full_job. The sample code in this article is relatively simple and not a typical computing scenario. The topology of the forward and ful computing graphs is the same. Most of the actual graph optimizations are implemented at this stage, such as Op fusion, AMP, ZeRO, constant folding, etc.

At this point, the main part of logic diagram construction is over.

Then a CNNGraph object will be built ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L947 ), and the corresponding C++ type is NNGraph ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.h#L33 ). This object will be responsible for building the physical calculation graph Plan. It is also the owner and maintainer of the entire runtime. When this object is destructed, the entire runtime will also terminate in an orderly manner and release resources.

5.5 Compilation of physical diagrams

The next step is to execute finish_complie_and_init_runtime ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L742 ), the core call of which is self. _c_nn_graph.complie_and_init_runtime() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L802 ), the corresponding C++ function is NNGraph: :CompileAndInitRuntime ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L265 ).

In this function, JobCompleter().Complete() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L280 ) will continue to Do several rounds of modification and optimization of the logic diagram to complete the additional information required for Runtime execution, Compiler().Compile() ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/ core/framework/nn_graph.cpp#L285 ) Convert the logical diagram into a physical diagram of the sub-device, and continue to modify and optimize the Plan.

The compilation of Plan is performed on the master node ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L282 ). The master node will push the Plan to each worker node through gRPC ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L308 ), the worker node Pull the physical calculation graph from the master ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L310 ).

Then call NewRuntimeBuffers to create a Buffer object ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L322 ), Buffer should be mainly used in the process information synchronization.

Then it's ready to initialize the runtime.

Please refer to the appendix for the compiled_job generated by the sample code and the json of the physical diagram Plan.

The final generated compiled logic diagram is as follows. The framework automatically inserts many system control nodes.

8f0b572fdacaa177d5038eaadc5d2faf.png


5.6 Structure of Plan

See the appendix for the Plan json data output by the sample code.

Plan is logically equivalent to compiled_job. Here we mainly focus on the relationship between task/op.

Each element in Plan.task is a task, where exec_sequence.exec_node corresponds to the op in the job, usually there is only one op (the array can support sub graph).

exec_node.kernel_conf.op_attribute describes op information. Among them, op_conf contains op name information.

kernel_conf.op_attribute.op_conf is the OperatorConf in Job.

kernel_conf.op_attribute.arg_signature.bn_in_op2lbi reflects the connection relationship between task/op.

bn_in_op is blob name in op, which is the blob name input by op.

Take System-AutoTick-DstSubsetTick_21 as an example

{
  "out": {
    "op_name": "System-AutoTick-DstSubsetTick_21",
    "blob_name": "out"
  },
  "in_0": {
    "op_name": "System-EagerCriticalSection-Interface-End-Tick-19",
    "blob_name": "out"
  },
  "in_1": {
    "op_name": "System-AutoTick-SrcSubsetTick_20",
    "blob_name": "out"
  }
}

exec_node.bn_in_op2regst_desc_id reflects the connection relationship at the task level. The key in this map represents the input and output, and the value is the register id.

{
"out": "29",
"in_0": "27",
"in_1": "28"
}

task.produced_regst_desc describes the register corresponding to task production, consumer_task_id is the consumer,

produced_regst_desc.out.regst_desc_type.data_regst_desc.lbi2blob_desc.lbi is the logic blob id of this register.

task.consumed_regst_desc_id describes the register information corresponding to task consumption

runtime initialization

In NNGraph::CompileAndInitRuntime, the new Runtime line of code will initialize the runtime ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L331 ) . The main things to do include:

  • Create Thread

  • Notify Thread to create Actor, Actor will create Regst and Kernel

  • Send the start signal kStart to source_tasks without input

6.1 Runtime creates Thread

In the constructor of Runtime, DumpThreadIdsFromPlan ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L65 ) will belong to the current process in Plan The thread id of the task is stored in the thread_ids_ variable. AddThreads creates these Thread objects ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L69 ).

Thread creates a physical thread when it is constructed (  https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L39 ), and the thread executes PollMsgChannel method ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L44 ), where Thread continues to wait for new messages to be processed.

Thread only handles two types of command messages: thread termination messages and Actor creation messages. Other messages are handed over to Actor::ProcessMsg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L83 ).

6.2 Runtime notifies Thread to create Actor

In the Runtime constructor, tasks are divided into two categories: source_tasks and other_tasks. In the example code, source_tasks ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L84-L85 ) is a task with no input edges.

From the code logic, in Plan proto, the consumed_regst_desc_id field of task is a map. If all the keys of this map are in_ctrl ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L54 ), this task is source_tasks.

Some examples of source_tasks are as follows:

  • System-Src-WaitAndSendIds_16

  • System-AutoTick-AppendDeviceTick_9

  • System-EagerCriticalSection-Interface-End-Tick-19

  • System-EagerCriticalSection-Interface-End-Tick-25

Runtime calls the HandoutTasks function ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L100-L101 ) will send the kConstructActor message to ActorMsgBus to construct Actor ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L49 ).

6.3 Message processing of ActorMsgBus and Thread

From the interface point of view, ActorMsgBus ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L24 ) is responsible for sending messages (Actor through ActorMsgBus Sending messages), Thread::PollMsgChannel ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L60 ) is responsible for receiving and processing messages.

The collaborative relationship of related entities is as follows

  • Actor is the basic unit of self-scheduling. It receives messages and then works, and then continues to send messages after finishing the work.

    • actor_id is task_id, which is determined when compiling the Plan. Task is a compile-time concept, and actor is an equivalent runtime concept.

    • task_id has a specific encoding format ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L21-L29 ), from which machine_id can be parsed ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L73 ) and thread_id ( https://github.com/Oneflow-Inc/oneflow /blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L77 ).

    • In the entire physical map plan across the network, the actor id is equivalent to the address, through which the unique actor entity can be located.

  • Actor sends ActorMsg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L24 ) via ActorMsgBus ::SendMsg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L24 github.com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/lazy/actor/actor_message.h#L34 ) message.

    • ActorMsg contains source and destination actor ids ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message.h#L84-L85 ).

    • If it is an in-process communication ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L26 ), it will pass ActorMsgBus::SendMsgWithoutCommNet ( https://github.com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/lazy/actor/actor_message_bus.cpp#L49 ) Send ActorMsg to the thread where the destination actor is located ( https://github. com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/thread/thread.h#L40 ).

    • Thread::EnqueueActorMsg will judge whether the current thread is an actor thread, if yes, enter the local queue, otherwise, enter the channel queue of actor thead.

    • If ActorMsg is a cross-process message, ActorMsgBus sends messages through CommNet ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L42-L44 ), the receiver's CommNet should obtain the thread id according to the actor id, find the Thread from ThreadMgr, and hand over the message to the Thread for processing.

  • Thread::PollMsgChannel ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L60 ) is responsible for receiving and processing messages.

    • If the thread local queue local_msg_queue_ is empty, take all ActorMsg from the thread's channel queue and put them into the local queue ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/ thread/thread.cpp#L63 ).

    • Take an ActorMsg from the local queue and start processing.

    • Handle some special kCmdMsg messages ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L67-L79 ), and then give ordinary messages to Actor Handle it yourself ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L83 ).

  • After the Actor receives the message, it will judge whether the condition of the Act is satisfied. If it is satisfied, the Act will be executed, thereby calling the LaunchKernel to perform calculations. After the execution of the Act, a message will be sent to notify the upstream and downstream Actors through ActorMsgBus.

The message passing relationship between these objects is shown in the following diagram

806204607cdd45cecd5502258a28a0f2.png

6.4 Activate source actors

In the current implementation, Actors are all self-scheduling and can only receive messages from other Actors. There is a special kind of source actors in Actor, which correspond to source tasks.

Source actors do not have upstream actors, they send messages to downstream actors to activate all actors to run.

How are the source actors themselves executed? After receiving the kStart message, they will act until they enter the exit process. But its kernel will be blocked at Buffer ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/common/buffer.h#L26 ), waiting for other threads to buffer After adding data, the blocking will be activated, and then the kernel will execute the read. After the kernel is completed, the actor's Act ends, and a message is sent downstream.

Source actors must have a separate actor thread because they will block.

The last step of Runtime initialization is to send kStart message to each source actor to activate them, but the source actor will only execute after receiving the buffer data, and then send a message to the downstream actors to make all actors execute.

Actor

7.1 Actor Creation

When Thread creates an Actor, it will first try to create it as a LightActor ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L104 ), if unsuccessful, try again Creates an Actor with a pre-registered factory.

There are several TaskTypes available for LightActor ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/light_actor.cpp#L677-L689 ):

  • kNormalForward, such as matmul, add and other user ops.

  • kCopyHd

  • kTick

  • kCollectiveBoxingGeneric

There are currently about 20 subtypes of Actors. Other Actor types are pre-registered according to TaskType ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/task.proto#L8 ). For example WaitAndSendIdsActor.

For the actor types corresponding to each node of the sample code, see the appendix.

Actor-related class relationships are as follows (inclusion relationship only means that related information can be accessed, and does not mean that objects of this type are created or owned)

2a41b458ec9cc328534b4029eee0af92.png

7.2 Actor initialization

The constructor of Actor is generally empty, and Init needs to be executed after construction ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp# L129 ) function for initialization.

LightActor inherits from ActorBase, not a subclass of Actor, and has its own Init function implementation. Only Actor initialization is discussed here.

In Actor::Init ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L129 ), first call ConstructKernel ( https: //github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L138 ) Create a kernel instance. Similar to Operator, kernel also uses OpTypeCase as the registered key, such as WaitAndSendIdsKernel ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L51 ) . An Actor usually has only one kernel.

Then call NewRegsts to create Regst ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L152 ). Tensor is a concept on the user side. The corresponding runtime concept is Regst ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/register/register.h#L24 ), which holds the Kernel to read and write of memory. The concept of Regst is broader than Tensor. For example, the control Op automatically added by the framework will also use Regst.

Actor saves the Regst created by itself to produced_regsts_ ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L153 ).

TakeOverNaiveConsumed ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L182 ) only records the regst id that needs to be consumed, but does not push to consumed_regsts_.

TakeOverNaiveProduced ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L183 ) not only records the regst id produced, but also pushes to naive_produced_rs_ ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L249 ). This distinction is made for the actor to execute smoothly when the calculation is performed for the first time. I will come back to discuss it later when I analyze Actor's message processing.

Calling InitBnInOp2BlobInfo initializes the BlobInfo ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L184 ).

Then call VirtualActorInit ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L185 ), where each Actor subclass is allowed to customize itself initialization logic. The OF_SET_MSG_HANDLER macro ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.h#L76-L80 ) is usually called to set the Actor's message processing function .

7.3 Actor's message processing

LightActor will first process kRegstMsg and kEordMsg messages respectively according to the message type. HandleRegstMsg ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/lazy/actor/light_actor.cpp#L424 ) according to the type of RegstMsg (kProduced or kComsumed) respectively Handle various read and write status counts.

Then judge whether the reading and writing count has reached the judgment condition. If it is reached, it means that the condition of reading and writing regst is met, and then execute ActOnce ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1 /oneflow/core/lazy/actor/light_actor.cpp#L451 ).

LightActor::ActOnce will go to InitBnInOp2Blob and InitActMsg the first time it is executed. InitBnInOp2Blob initializes the mapping relationship between bn and Blob in resgt, and provides the kernel with the function of accessing Blob through bn. InitActMsg will initialize all the messages that need to be sent to avoid repeated message construction when sending subsequent messages.

Then there is LaunchKernel, and then ResetState will reset the regst state.

After LaunchKernel, the previously constructed message will be sent out, the synchronous message will be directly queued into the thread message queue, and the asynchronous message will be sent to ActorMsgBus through the callback.

Ordinary Actor::ProcessMsg will call msg handler to process the message, the most common msg handler is Actor::HandlerNormal ( https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/ lazy/actor/actor.cpp#L329 ).

The process in Actor::HandlerNormal is similar to that in LightActor, and it will be processed separately according to different regst types. The state management method of regst in Actor is different from that in LightActor. The method in LightActor is more efficient, and some special cases can be handled in Actor.

After the message is processed, ActUntilFail will be called, and ActUntilFail will judge IsReadReady and IsWriteReady to determine whether the Act can be performed.

The most common NaiveActor::Act() is to execute AsyncLaunchKernel.

After the Act is completed, it starts sending regst messages upstream and downstream.

There are also some special Actors. Let's take WaitAndSendIdsActor as an example to observe the message processing mechanism of such Actors.

The reason for choosing this example is that the Actor is relatively simple; the second is that this is a typical source task, and I want to see how the calculation graph is triggered to start the calculation.

If the message received by Thread is not kStopThread or kConstructActor, call Actor::ProcessMsg ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L83 ), will The message is forwarded to the Actor for processing.

The ProcessMsg function simply transfers the message to the handler ( https://github.com/Oneflow-Inc/oneflow/blob/b6bf3f8843679111eb1edf79deefce814d250f4e/oneflow/core/lazy/actor/actor.h#L38 ).

In WaitAndSendIdsActor::VirtualActorInit, the handler is set to HandlerWaitToStart ( https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor).pp55

In the constructor of Runtime, the first batch of messages sent is the kStart message to source_tasks, and this message is processed by the HandlerWaitToStart function.

After HandlerWaitToStart checks the message type, set the handler to HandlerNormal ( https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/job/runtime.cpp#L109 ) (this is also the default handler), and then call ProcessMsg ( https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L74 ), which actually calls the newly set Nhandler Handler.

In HandlerNormal, if it is kCmdMsg, only kStart is allowed ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L377 ). After passing the message type verification, ActUntilFail ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L378 ) will be called directly.

7.4 Conditions for Act execution

LightActor and Actor use different strategies to determine whether to perform Act. LightActor is more efficient, and Actor can handle some special cases.

For LightActor, when the reading register count total_reading_cnt_ returns to 0, the consumable register count ready_consumed_ increases to max_ready_consumed_, the former indicates that all consumers have read the Regst of the current LightActor, and the latter indicates that all the Regst consumed by the current LightActor has arrived (by the upstream Regst message sent).

For Actor, in Actor::ActUntilFail, the Act method ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L424 ) is implemented by each subclass Generally, it mainly starts the kernel calculation.

But before executing Act, you need to confirm:

  • Is the data that Act execution depends on ready? (IsReadReady)

  • Has the consumer used up the data produced by Act and received an ack message for confirmation? (IsWriteReady)

Actor has 4 member variables related to this

  • RegstSlot naive_produced_rs_;

  • RegstSlot inplace_produced_rs_;

  • RegstSlot naive_consumed_rs_;

  • RegstSlot inplace_consumed_rs_;

xx_produced_rs_ stores the used ack regst information returned by the downstream consumer of the current Actor. (Regsts produced by the current Actor are stored in produced_regsts_.)

During the initialization process of the runtime, all Actors have not been run, and it is impossible for any Actor to receive an ack message. Therefore, when the Actor is initialized, xx_produced_rs_ must be pre-filled, so as to ensure that the Actor is WriteReady before the first run, so that it can go smoothly Start execution.

xx_consumed_rs_ stores data sent by upstream dependencies. It does not need to be pre-populated. Because source_tasks has no input dependencies, it is naturally ReadReady; and xx_produced_rs_ is pre-populated during initialization to ensure that it is WriteReady, so source_tasks can be run directly. The output message of source_tasks is sent to the downstream, and the downstream will become ReadReady, and the downstream is also guaranteed to be WriteReady after initialization. The entire Actor system can work like this.

7.5 Notification mechanism between upstream and downstream actors

After the Act is executed, the result data needs to be sent to the downstream consumer. Taking Naive Produced of WaitAndSendIds as an example, the calling process in ActUntilFail is as follows:

  • AsyncSendNaiveProducedRegstMsgToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L427

    • VirtualAsyncSendNaiveProducedRegstMsgToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L441

      • HandleProducedNaiveDataRegstToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L446

        • HandleRegstToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L577)

          • EnqueueAsyncMsg(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L523

            • ActorMsgBus::SendMsg if the target thread is the current thread ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L662 )

            • Otherwise, enqueue the message to async_msg_queue_ ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L664 )

          • Increase total_reading_cnt_ ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L526 ) (this variable indicates that the message has been sent to the downstream but not received number of acks)

        • naive_produced_rs_.PopFrontRegsts(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L581

    • AsyncSendProducedCtrlRegstMsgToConsumer

Note that naive_produced_rs_.PopFrontRegsts ( https://github.com/Oneflow-Inc/oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L53 ) will delete the Regst pointer from the queue, and the corresponding available ( https://github.com/Oneflow-Inc/oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L49 ) register count minus 1 ( https://github.com/Oneflow-Inc/ oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L49 ).

And when processing the received kRegstMsg message ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L340 ) in Actor::HandlerNormal, if it is The ack message sent by the consumer will call TryUpdtStateAsProducedRegst ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L355 ), and add Regst to naive_produced_produced ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L654 ), to ensure that the current Actor is WriteReady after receiving all acks; at the same time Decrements the register count total_reading_cnt_ being read.

Actors handle dependent upstream messages similarly. Send an ack message to the upstream through the following function call to notify that the register has been exhausted and you can continue to update:

  • AsyncSendNaiveConsumedRegstMsgToProducer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L431

  • AsyncRetInplaceConsumedRegstIfNoConsumer ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L432 ) adds message to ActorM::HandlerNormal after receiving g in ActorM::HandlerNormal consumed_rs_ ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L344 ), to ensure that the current Actor is Ready after receiving all dependent data.

LightActor has its own message processing mechanism ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/light_actor.cpp#L299 ), the general principle should be similar.

7.6 Actions performed by Act

According to the above discussion, Actor will also enter ActUntilFail execution after receiving kRegstMsg. If both reading and writing are Ready, execute Act ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L424 ). Taking WaitAndSendIdsActor as an example, the main call link is as follows:

  • AsyncLaunchKernel(https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L58

  • ek.kernel->Launch ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L562 ), start Kernel calculation

  • Forward(https://github.com/Oneflow-Inc/oneflow/blob/eae9ff38f074479d79ce24b0f6e0594f82126171/oneflow/core/kernel/kernel.cpp#L52

  • ForwardDataContent(https://github.com/Oneflow-Inc/oneflow/blob/eae9ff38f074479d79ce24b0f6e0594f82126171/oneflow/core/kernel/kernel.cpp#L65

  • buffer->Pull(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L40

  • Assign the storage address mut_dptr of regst ( https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L47 )


buffer->Pull will wait for the notification of the condition variable ( https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L60 ). Now, it looks like all the Actors are ready to go, just waiting for the gun to go off.

Start the calculation of the static graph

Graph.__run ( https://github.com/Oneflow-Inc/oneflow/blob/81edd938826a7ea903174d682348847658b64653/python/oneflow/nn/graph/graph.py#L226 ) will pull the trigger of the starting gun and start a round of calculation graph calculate.

The main calling process is as follows:

  • RunLazyNNGraph(https://github.com/Oneflow-Inc/oneflow/blob/81edd938826a7ea903174d682348847658b64653/python/oneflow/nn/graph/graph.py#L1076

  • builder->LaunchLazyJob(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L568

  • LaunchLazyJobInstructionType(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/instructions_builder.cpp#L179

  • Buffer::Push(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/instructions_builder.cpp#L179

Buffer::Push here is the starting signal that WaitAndSendIdsKernel is waiting for.

run-time exit mechanism

The entire runtime contains many objects and resources, and a safe and orderly exit is a complex and meticulous work. Here we only take WaitAndSendIds as an example to observe the exit mechanism at runtime from one side.

The exit of the runtime starts with the destruction of the NNGraph object ( https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L76 ).

9.1 Actor exit

  • When NNGraph is destructed, all Buffer objects will be closed ( https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L82 ).

  • When the Buffer is closed, it will set is_closed_ = true and notify all listeners ( https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L81 ). But Pull will continue to process the submitted calculations.

    • Therefore, Buffer should be a class mainly used for in-process communication and asynchronous coordination.

  • WaitAndSendIdsKernel is waiting for a new round of calculation to start ( https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kerrneluffel.cpp#L40 ), the result received the return from PulllosCultatedk https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L61 ).

  • WaitAndSendIdsActor::IsCustomizedReadReady has been returning false since then ( https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp also returns https: Reads:Reads_actor.cpp#L68se //github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L533 ).

    • After this, ActUntilFail will only perform asynchronous message sending ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L437 ) (no longer enter the while loop)

  • WaitAndSendIdsActor::HandlerNormal will still process messages from other actors ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L340 ). But because IsCustomizedReadReady returns false, it will enter AsyncSendEORDMsgForAllProducedRegstDesc ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor4cpp ) to execute. It sends a kEordMsg message to each downstream ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L614 ).

  • After receiving the kEordMsg message from the upstream, the actor decrements remaining_eord_cnt_ ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L331 ).

    • remaining_eord_cnt_ is initialized to the number of input regsts for the Actor ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L171 ).

  • total_reading_cnt_ is the number of messages produced by the current actor that have been sent to the consumer but have not yet received ack.

    • Actor can still normally receive the ack message sent by the consumer.

  • When the above 2 variables are both 0 ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L395 ), it means that all upstreams have issued kEordMsg message, and all downstream ack messages have also been received. Actor returns 1 to Thread ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L397 ).

    • If the above two variables are not 0, modify the handler by HandlerZombie ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L399 ) Handle subsequent incoming messages.

  • After Thread receives 1 returned by Actor ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L84 ), it is deleted from its own storage ( https ://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L89 ), and decrements the number of running Actors.

9.2 Thread exit

  • NNGraph resets runtime_ causing the runtime object to be destructed ( https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L83 ).

  • Runtime deletes all Threads ( https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/job/runtime.cpp#L117 ).

  • ThreadMgr sends kStopThread message to all Threads ( https://github.com/Oneflow-Inc/oneflow/blob/c8c6d351fa28c5ebce948d69c06670a783f83f74/oneflow/core/thread/thread_manager.cpp#L64 ). At the same time, resetting the pointer leads to Thread destruction ( https://github.com/Oneflow-Inc/oneflow/blob/c8c6d351fa28c5ebce948d69c06670a783f83f74/oneflow/core/thread/thread_manager.cpp#L66 ).

  • The physical thread of Thread exits the PollMsgChannel loop ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L68 ).

  • Thread waits for the end of the physical thread and closes the channel ( https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L52 ).

10 

Static graph of a distributed scene

Distributed compile_job, physical map plan, and stand-alone scenarios have changed significantly.

For example, each process has a set of control nodes such as WaitAndSendIds. This is also easy to understand, because each node must execute __run and Buffer::Push/Pull, and must start the Actors of the process to perform calculations.

User ops such as matmul and broadcast_add are also calculated on the two nodes.
 

082db878a0d43766c79782d31502d43f.png

10.1 Sample code

For the startup method, refer to the official documentation of Global Tensor.

import oneflow as flow
import oneflow.nn as nn


P0 = flow.placement("cpu", ranks=[0, 1])
a0_sbp = flow.sbp.split(0)


class ModuleMyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(flow.randn(in_features, out_features,
            placement=P0, sbp=flow.sbp.broadcast))
        self.bias = nn.Parameter(flow.randn(1, out_features,
            placement=P0, sbp=flow.sbp.broadcast))


    def forward(self, input):
        return flow.matmul(input, self.weight) + self.bias


linear_model = ModuleMyLinear(4, 3)


class GraphMyLinear(nn.Graph):
    def __init__(self):
        super().__init__()
        # ModuleBlock
        self.model = linear_model


    def build(self, input):
        # ModuleBlock.__call__
        return self.model(input)


graph_mylinear = GraphMyLinear()
input = flow.randn(5, 4, placement=P0, sbp=flow.sbp.split(1))
out = graph_mylinear(input)
print(out)

11 

appendix

11.1 Breakpoints

11.1.1 Python breakpoint example

# python3 -m pdb test.py
break test.py:25
break oneflow/nn/graph/graph.py:221
break oneflow/nn/graph/graph.py:741
break oneflow/nn/graph/graph.py:745
break oneflow/nn/graph/graph.py:759
break oneflow/nn/graph/graph.py:828
break oneflow/nn/graph/graph.py:777
break oneflow/nn/graph/graph.py:1066
break oneflow/nn/graph/graph.py:1133
break oneflow/framework/graph_build_util.py:227

11.1.2 C++ breakpoint example

start command

source /mnt/oneflow/build/source.sh
gdb --args python3 /mnt/oneflow/test.py
# set breakpoints
# run

breakpoint example

set breakpoint pending on
break oneflow::ActorMsg::BuildEordMsg
break oneflow/core/common/buffer.h:80
break oneflow::(anonymous namespace)::CheckAndConstructOp
break oneflow::WaitAndSendIdsActor::Act
break oneflow::WaitAndSendIdsActor::HandlerWaitToStart
break oneflow/core/lazy/actor/light_actor.cpp:452
break oneflow/core/lazy/actor/light_actor.cpp:485
break oneflow::ForeignInputKernel::ForwardDataContent
break oneflow::vm::LaunchLazyJobInstructionType::Compute

11.2 JSON representation of static graph

  • forward(https://quip.com/OMc4A0HOOr0C)

  • full(https://quip.com/JLaMAHGBLXmK)

  • compiled(https://quip.com/tXjuAiS3J0Ab)

  • plan(https://quip.com/a0DMAAIte6PQ)

11.3 actor type

naive_actor

System-AutoTick-AppendDeviceTick_9
System-AutoTick-DstSubsetTick_12
System-AutoTick-DstSubsetTick_21
System-AutoTick-DstSubsetTick_27
System-AutoTick-Prepend-DeviceTick_7
System-AutoTick-SrcSubsetTick_20
System-AutoTick-SrcSubsetTick_26
System-AutoTick-SrcSubsetTick_8
System-AutoTick-Tick_11
System-AutoTick-Tick_13
System-EagerCriticalSection-Callback-23
System-EagerCriticalSection-Callback-29
System-EagerCriticalSection-Interface-Begin-Tick-18
System-EagerCriticalSection-Interface-Begin-Tick-24
System-EagerCriticalSection-Interface-End-Tick-19
System-EagerCriticalSection-Interface-End-Tick-25
System-EagerCriticalSection-Wait-22
System-EagerCriticalSection-Wait-28

light_actor

_GraphMyLinear_0_input.0.0_2
_GraphMyLinear_0_output.0.0_2
model.bias
model-broadcast_add-1
model-matmul-0
model.weight
System-AutoTick-SinkTick_15
System-SyncAllRanksSinkTick_14

wait_and_send_ids_actor

‍‍‍
System-Src-WaitAndSendIds_16

call_back_notify_actor

‍‍‍
System-Sink-CallbackNotify_17

12 

References

  • oneflow v0.8.0(https://github.com/Oneflow-Inc/oneflow/tree/release/v0.8.0)

  • System Design of OneFlow Framework (Part 1) (https://zhuanlan.zhihu.com/p/337851255)

  • System Design of OneFlow Framework (Part 2) (https://zhuanlan.zhihu.com/p/338699487)

  • System Design of OneFlow Framework (Part 2) (https://zhuanlan.zhihu.com/p/339208452)

  • The execution process of a Job in OneFlow—Part 1 (https://zhuanlan.zhihu.com/p/344531540)

  • The execution process of a Job in OneFlow - Part 2 (https://zhuanlan.zhihu.com/p/355654002)

  • The execution process of a Job in OneFlow—Part 2 (https://zhuanlan.zhihu.com/p/363689736)

  • Static graph module nn.Graph (https://docs.oneflow.org/master/basics/08_nn_graph.html)

  • OneFlow system design (https://docs.oneflow.org/v0.4.0/basics_topics/essentials_of_oneflow.html)

  • torch.nn.Module(https://pytorch.org/docs/1.10/generated/torch.nn.Module.html)

everyone else is watching

欢迎Star、试用OneFlow最新版本:GitHub - Oneflow-Inc/oneflow: OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. - GitHub - Oneflow-Inc/oneflow: OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.https://github.com/Oneflow-Inc/oneflow/

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/128556815