OneFlow source code analysis: SBP Signature derivation in Eager mode

f690c99cdae582fbd796777e776a77c0.jpeg

Author|Jianhua Zheng
Update|Luyang Zhao


OneFlow's Global Tensor has two required properties:

  • Placement : Determines which devices tensor data is distributed on.

  • SBP : Determines how tensor data is distributed on these devices. For example:

    • split: Put different parts after splitting into different devices; specify the axis of splitting at the same time.

    • broadcast: Copy data to each device.

If the SBP of the tensor involved in the operation is different, what is the SBP of the resulting tensor? For example the following code:

# export MASTER_ADDR=127.0.0.1 MASTER_PORT=17789 WORLD_SIZE=2 RANK=0 LOCAL_RANK=0
# export MASTER_ADDR=127.0.0.1 MASTER_PORT=17789 WORLD_SIZE=2 RANK=1 LOCAL_RANK=1
import oneflow as flow


P0 = flow.placement("cpu", ranks=[0, 1])


t1 = flow.Tensor([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]], placement=P0, sbp=flow.sbp.split(0))
# t1 = flow.Tensor([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]], placement=P0, sbp=flow.sbp.broadcast)
t2 = flow.Tensor([[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]], placement=P0, sbp=flow.sbp.split(1))
t3 = t1 + t2
# oneflow.placement(type="cpu", ranks=[0, 1])
print(t3.placement)
# (oneflow.sbp.split(dim=0),)
print(t3.sbp)

t1 and t2 are two tensors distributed on the same device. t1.sbp is S(0), segmented on the row; t2.sbp is S(1), segmented on the column.


The SBP of the calculation result t3 does not need to be manually specified by the user, and the system can automatically derive t3.sbp as S(0). A core step in this process is the derivation of SBP Signature.

SBP related concepts

1.1 SBP

SBP is a unique concept in OneFlow, which describes a mapping relationship between tensor logical data and tensor data stored on real physical device clusters. The following content refers to the official SBP documentation ( https://docs.oneflow.org/master/parallelism/02_sbp.html#sbp ):

In detail:

  • Split represents the Tensor on the physical device, which is obtained by dividing the Tensor from the global perspective. When splitting, you need to specify the dimension of splitting. The Tensor on the physical device can be restored to obtain the Tensor of the global perspective after splicing.

  • The broadcast represents the Tensor in the global perspective, which will be copied and broadcast to all physical devices.

  • partial means that the Tensor from the global perspective has the same shape , but the value on the physical device is only a part . Taking partial sum as an example, if we add the tensors of all devices in the cluster by position, then we can restore the Tensor from the global perspective. In addition to sum, operations such as min and max are also applicable to partial.

The following figure shows the situation of SBP respectively, which are split(0), split(1), broadcast and partial sum.

70e18ada003d6d200e68f72d47f4359e.png

1.2 SBP Signature

SBP Signature is the SBP signature, which is an original and important concept in OneFlow. The following text in this section is taken from the official documentation of SBP Signature:

  • For an isolated Tensor, we can set its SBP attribute at will. However, for an operator with input and output data, we cannot arbitrarily set its input and output SBP attributes. This is because arbitrarily setting the SBP attribute of an operator's input and output may not conform to the operator's algorithm from the global perspective.

  • For an operator, a specific and legal combination of SBP attributes of its input and output is called an SBP Signature of the operator.

  • According to the algorithm of the operator, the author of the operator has listed and preset all the possible SBP Signatures of the operator when developing the operator.

  • As long as an operator at a certain layer has an input SBP attribute, OneFlow can deduce the SBP attribute output by the operator at this layer according to the SBP Signature.

  • The so-called automatic derivation of SBP Signature refers to: given all legal SBP Signatures of all operators, OneFlow has a set of algorithms that will score each legal SBP Signature based on the transmission cost, and select the transmission cost The smallest one is the SBP Signature. This makes the throughput efficiency of the system the highest.

  • What should I do if the SBP Signature automatically selected by OneFlow does not match the SBP attribute of the output of the operator on the upper layer and the input of the operator on the next layer? OneFlow will detect this inconsistency, and insert a type of operator between the upstream output and the downstream input to do related conversion work. Such operators that are automatically added for conversion are called Boxing operators.

To sum up, the main points of SBP Signature are as follows:

  • Each operator needs to set the corresponding SBP signature to describe the distribution method of data (Tensor).

  • The SBP signature includes all input and output SBP of the operator. The absence of (partial) input, or (partial) output, does not constitute a signature.

    • So SbpSignature.bn_in_op2sbp_parallel is a map structure, and the key is the identification of each input and output.

  • The combination of input and output SBP signatures must be legal under the algorithm of the operator, and the author of the operator needs to list the candidate set of legal SBP signatures.

  • If the SBP of the input data (input tensor) is inconsistent with the legal SBP signature of the operator, in order to obtain the data (tensor) required by the operator for correct calculation, OneFlow will insert a boxing operator between the upstream output and the downstream input (May include collective communication operations such as nccl), do automatic conversion work, this kind of automatic conversion process is called Boxing. For example, the interpreter in eager global mode completes the Boxing process in the GetBoxingOutput method.

1.3 NdSbp and NdSbpSignature

In the above section 1.1, we learned that SBP is used to describe a logical tensor (Tensor) and its corresponding mapping relationship on physical devices. What does 2D or even ND SBP in OneFlow mean?

A simple understanding is that ordinary SBP (1D/1-dimensional SBP) can only split tensors in a relatively coarse-grained manner. For example, split(0) means to split along the 0th dimension of tensors. If based on this In general, if you want to perform finer-grained segmentation, such as continuing to "cut one knife" along the first dimension, then ordinary 1D SBP cannot do it, so 2D or ND SBP is required.

The following text mainly refers to the official document 2D SBP.

We can specify that the data of the tensor is distributed on these 4 devices through ranks=[0, 1, 2, 3]. These 4 devices form a one-dimensional device matrix. The corresponding SBP, such as split(1), is a single value, ie 1D SBP.

The distribution of Tensor data can also be specified as ranks=[[0, 1], [2, 3]]. The four computing devices are divided into a 2x2 device matrix. At this time, SBP must also correspond to it, which is an array with a length of 2. The corresponding NdSbp.sbp_parallel type is an array.

For example sbp = (broadcast, split(0)). The meaning of this 2D SBP is:

  • Perform broadcasting in the first dimension of ranks, and copy the data to group 0 (rank [0, 1]) and group 1 (rank [2, 3]) respectively.

  • Perform split(0) separately on the second dimension of ranks.

    • For example, for group 0, split the data assigned to it in the previous step into (1, 2) and (3, 4) by row and give them to device 0 and device 1 respectively.

The schematic diagram is as follows:

ffdb118ff9896257a52c1256f7aa462c.png

If the data distribution form of Tensor is multi-dimensional, such as [[0, 1], [2, 3]], the SBP Signature corresponding to the operator is also multi-dimensional, so in NdSbpSignature, the sbp_parallel corresponding to each input/output is an array .

placement.hierarchy

The C++ type corresponding to placement is ParallelDesc. The ranks used to construct placement can be a multidimensional array, representing the multidimensional distribution matrix of equipment.

placement.hierarchy represents the hierarchical information of ranks on placement. Simply understand, hierarchy is used to describe the shape of the ranks distribution (similar to the shape that can be used to describe the distribution of tensor data), and the hierarchy stores the size information of ranks in each dimension.

  • The length of the hierarchy array is the dimension of ranks.

  • The element value of the hierarchy array is the size of the dimension corresponding to ranks.

  • The C++ code for constructing the hierarchy can refer to GetRanksShape.

Run the following code to observe the value of hierarchy.

import oneflow as flow


placements = [
    flow.placement("cpu", ranks=[ 0, 1, 2,   3, 4, 5]),
    flow.placement("cpu", ranks=[[0, 1, 2], [3, 4, 5]]),
]
for p in placements:
    print(p.hierarchy)
# outputs:
# [6]
# [2, 3]

Which operator is tensor add?

In order to improve performance, starting from v0.8.0, Tensor interfaces are basically provided to Python through C API.

Many Tensor methods are defined in PyTensorObject_methods. However, the add method is implemented through the number protocol of the Python C API, specifying PyTensorObject_nb_add to implement the addition operation, which is actually implemented by functional::add.

The definition of functional::add is in functional_api.yaml.pybind.cpp, which is an automatically generated file at build time. Following this search, it is easy to find that the sample code corresponds to AddFunctor. The name of Op is "add_n", and the automatically generated file op_generated.cpp defines that the Op corresponding to add_n is AddNOp. Several methods of AddNOp defined in add_n_op.cpp will be used in the derivation process of SBP Signature.

The derivation process of one-dimensional SBP

SBP Signature derives related class relationships as follows:

6c9be29830d0203821bfe9db9db42586.png

When the tensor add operation (t1 + t2) in the sample code is executed to call GetOrInfer in the Interpreter, the SBP Signature will be deduced. In GlobalTensorInferCache::GetOrInfer, the derivation result will be stored using GlobalTensorMetaInferArgs as the key, and there is no need to deduce it every time.

The hash function of GlobalTensorMetaInferArgs mainly depends on the following information of the input tensor:

  • shape

  • dtype

  • nd_sbp

  • placement

  • consumer_nd_sbp_constraint

    Different tensor objects can reuse the same derivation result as long as the meta information is the same.

UserOpExpr holds all derived results through GlobalTensorInferCache.

4.1 Derivation preparation in GlobalTensorInferCache

The actual derivation is done in GlobalTensorInferCache::Infer.

4.1.1 Deriving the shape and dtype of output

The role of user_op_expr.InferLogicalTensorDesc is mainly to derive the shape and data_type of the output, and save the result to output_mut_metas. This involves the interaction between the two modules UserOpExpr and Op. Some interactive interfaces between several modules will be summarized later.

The two function objects used in user_op_expr.InferLogicalTensorDesc are defined in Op and registered in OpRegistry. The function object of OpRegistryResult comes from the Op registry. The Op corresponding to tensor add in the sample code is AddNOp.

An example of the actual call sequence for the AddNOp scenario is as follows:

  • user_op_expr.InferLogicalTensorDesc

    • logical_tensor_desc_infer_fn_->AddNOp::InferLogicalTensorDesc

    • out.shape = in[0].shape

    • dtype_infer_fn_->AddNOp::InferDataType

    • out.data_type = in[0].data_type

4.1.2 Construct UserOp

MakeOp(user_op_expr...) returns an Operator, the specific type is UserOp (refer to the discussion of the static graph before). This object is responsible for performing the concrete derivation.

CheckInputParallelDescIdentical requires the placement of all inputs to be consistent. Because here is the derivation for UserOp, such as tensor add, matmul and other operations, when the operands are all on the same device, these operations can be directly calculated, otherwise, the data needs to be moved together through the system Op, and then calculated.

Since the placement of all inputs is the same, use the first one as the representative and assign it to UserOp to save.

The function of op->InferParallelSignatureIf() is to fill the placement into op.bn2parallel_desc_.


For AddNOp, the key is in_0, in_1, out_0, and the value is inputs[0].placement.

The infer_args.MakeInputBlobDescs operation is expressed in pseudocode as follows:

# for each input index i
blob_descs[i].shape = inputs[i].shape
blob_descs[i].stride = inputs[i].stride
blob_descs[i].data_type = inputs[i].data_type

The infer_args.MakeNdSbpInferHints operation is expressed in pseudocode as follows:

# for each input index i
hints[i].parallel_desc = inputs[i].parallel_desc
hints[i].blob_desc = blob_descs[i]
hints[i].nd_sbp = inputs[i].nd_sbp

The function of blob_descs is to construct pd_infer_hints, pd_infer_hints is to construct NdSbpInferHint4Ibn, and encapsulate relevant information into this function object. This function object is passed to UserOp for deduction. In UserOp, through this function object, NdSbpInferHint can be obtained according to the input/output identification bn (blob name), so that the above meta information can be obtained.

After UserOp is deduced, GlobalTensorInferCache will save the meta information of inputs/outputs, together with the deduced NdSbp, to GlobalensorInferResult.

4.2 Derivation preparation in Operator

In Operator::InferNdSbpSignatureIf, call InferNdSbpSignature for actual derivation, and then call FillNdSbpSignature to save the derivation result.

InferNdSbpSignature is a virtual function. UserOp will first check whether Op has defined its own SBP Signature derivation function. If AddNOp does not have this function, it will call Operator::InferNdSbpSignature.

InferNdSbpSignature will judge whether it is 1D SBP or ND SBP according to parallel_desc.hierarchy().


Let's just look at the case of 1D SBP first. Call the NdSbpInferHint4Ibn function object passed in, find out the NdSbpInferHint created in GlobalTensorInferCache, convert it to NdSbpInferHint and store it in the map. Because it is one-dimensional, only the first element of sbp_parallel needs to be taken. Then call InferSbpSignature (Nd is missing in the name), and write the derivation result to SbpSignature.


Whether it is one-dimensional or multi-dimensional, the type of the result is NdSbpSignature. So convert SbpSignature to NdSbpSignature.

The function of Operator::InferSbpSignature is mainly to construct two function objects, SbpInferHint4Ibn and CalcOrderValue4SbpSig, and then call the overloaded virtual function InferSbpSignature of the subclass override with the same name.


SbpInferHint4Ibn encapsulates the incoming map data into a function object, which is used to query the meta information of input and output.


CalcOrderValue4SbpSig calculates an order value for each SbpSignature, which is used to sort the signatures.

InferSbpSignature is also a virtual function. Because AddNOp does not define a signature derivation function, Operator::InferSbpSignature is called.

4.3 Derivation of SbpSignature

All kinds of preparations have been made before, and the real derivation is only carried out in Operator::InferSbpSignature. In simple terms, there are 3 steps:

  • get candidate set

  • Filter out inappropriate signatures

  • to sort

4.3.1 Candidate set for SbpSignature

Calling GetValidNdSbpSignatureList will get the candidate set of SbpSignature. In this function, first call GetNdSbpSignatureList to obtain the preliminary candidate set, and then filter through FilterNdSbpSignatureListByLogicalShape to get the correct and available candidate set. Candidate sets are saved to sbp_sig_list.

GetNdSbpSignatureList is a virtual function, and UserOp implements its own version. The core operation in this function is val_->get_nd_sbp_list_fn, which actually calls AddNOp::GetSbp. UserOpSbpContext is part of the protocol interface between classes like UserOp and AddNOp.

As mentioned above, it is the responsibility of the operator to provide the candidate set of SBP Signature. The operator AddNOp is relatively simple, and only two types of signatures are given:

  • For each axis i of the input tensor's shape, a split(i) is created for all inputs/outputs.

    • For tensor add, the shape of input/output is the same to be directly calculated, so the axis of split is also the same.

  • All input/output create a partialsum.

  • The broadcast situation will be set by default in the Operator, because theoretically all inputs/outputs should support operations in the form of broadcast.

An example of candidate set data is as follows:

{"sbp_signature":[{"bn_in_op2sbp_parallel":{"in_0":{"split_parallel":{"axis":"0"}},"in_1":{"split_parallel":{"axis":"0"}},"out_0":{"split_parallel":{"axis":"0"}}}},{"bn_in_op2sbp_parallel":{"in_0":{"split_parallel":{"axis":"1"}},"in_1":{"split_parallel":{"axis":"1"}},"out_0":{"split_parallel":{"axis":"1"}}}},{"bn_in_op2sbp_parallel":{"in_0":{"partial_sum_parallel":{}},"in_1":{"partial_sum_parallel":{}},"out_0":{"partial_sum_parallel":{}}}},{"bn_in_op2sbp_parallel":{"in_0":{"broadcast_parallel":{}},"in_1":{"broadcast_parallel":{}},"out_0":{"broadcast_parallel":{}}}}]}

4.3.2 Filtering inappropriate signatures

Filter inappropriate signatures in two steps

  • In FilterAndCheckValidSbpSignatureListByLogicalShape, for each input tensor ibn, the split axis of the ibn in the signature must be less than the number of shape axes of the tensor ibn. In other words, if the tensor is two-dimensional, split(2) cannot be accepted, only split(0) or split(1).

  • The function of FilterSbpSignatureList is to check the sbp_sig_conf constraint, which is the parameter nd_sbp_constraints passed all the way from GlobalTensorInferCache. This filtering rule requires that the content of the qualified signature must contain sbp_sig_conf.

4.3.3 Signature Sorting

SortSbpSignatureListByCopyCost sorts candidate signatures.

  • Compare by OrderValue first

  • When OrderValue is equal, compare by CopyCost
    Both are smaller value takes precedence.

OrderValue4SbpSig is an encapsulation of CalcOrderValue4SbpSig, which pre-calculates the OrderValue of all signatures and stores them in the map, which is convenient for the sort function to search. The same is true for IbnCopyCost4SbpSig.

Look back at the definition of CalcOrderValue4SbpSig. Because AddNOp has input, a weight will be added to each input tensor ibn. When the sbp of ibn is the same as the corresponding sbp in the signature, the weight value is -10, which increases the chance of selection, because the sbp consistency is usually No data movement is required. The conditional judgment of parallel_num should be established under UserOp.

When sbp_sig_conf is not empty, CalcOrderValue4SbpSig returns 0 directly. Because if the signature does not contain sbp_sig_conf, even if the SBPs are consistent, the signature may not meet the requirements, so it returns 0 directly.

The signing cost is calculated by ComputeIbnCopyCost4SbpSig. The cost is mainly calculated based on the input and signed sbp:

  • If the sbp is consistent, the cost is 0

  • Both partial_sum and broadcast cost are huge numbers.

  • Otherwise cost is equal to the number of data transfer bytes of the input tensor.

4.4 Derivation results

The derived nd_sbp_signature is as follows:

{"bn_in_op2nd_sbp":{"in_0":{"sbp_parallel":[{"split_parallel":{"axis":"0"}}]},"in_1":{"sbp_parallel":[{"split_parallel":{"axis":"0"}}]},"out_0":{"sbp_parallel":[{"split_parallel":{"axis":"0"}}]}}}

In the sample code, if one input is split and the other is broadcast, the deduced signature results are both broadcast. If the inferred sbp signature is split, can data handling be reduced?

Derivation process of NdSbp

The derivation of NdSbp mainly includes 3 steps

  • Call GetValidNdSbpSignatureList to get valid signatures

  • Eliminate signatures that cannot contain nd_sbp_constraints

  • Greedy search for better signatures

Focus on obtaining a valid signature. There are mainly two steps:

  • GetNdSbpSignatureList: Get all signatures

  • FilterNdSbpSignatureListByLogicalShape: Filter inappropriate signatures

5.1 Candidate Set for NdSbp Signatures

The core of GetNdSbpSignatureList is two steps:

  • GetSbpSignaturesIf: Get one-dimensional signatures (same as 1D SBP)

  • DfsGetNdSbpSignature: Expand to multi-dimensional based on one-dimensional signature

This process, if you go deep into the data details, will involve multiple dimensions such as input/output, ranks, NdSbp, etc., which is a bit abstract and complicated. It will be easier to understand if you start from the physical meaning of ranks and NdSbp described in the official document 2D SBP.

Take ranks=[[0, 1, 2], [3, 4, 5]] as an example (ranks=[r1, r2])

This is a 2D device matrix/array. Each input and output of the operator also has two sbp, and the value in NdSbpSignature is two-dimensional and has two slots. Suppose Op's 1D Sbp has n signatures.


Formally, NdSbpSignature first organizes data by bn. However, from the perspective of the data distribution process, the data is first organized according to SbpSignature. A NdSbpSignature is equivalent to an array of SbpSignatures. Each slot in NdSbp represents a 1D Sbp data distribution (all input/output are distributed together).

  • For example, the 0th slot is to distribute data between the two sub groups r1 and r2. This distribution must be a valid 1D SbpSignature (all input/output are distributed together).

  • The first slot, for r1, is to distribute the subset of data allocated to it according to a SbpSignature (all input/output are distributed together).


Therefore, you only need to fill up the two slots as a whole according to SbpSignature. Each slot has n possibilities, and there are n*n candidate signatures in total. The candidate set generated in this way is complete and will not miss candidates. This should be the meaning of direct product of 1D sbp signatures.

6

Collaboration between modules

The implementation of SbpSignature derivation uses a lot of functional code. It should be for the purpose of information shielding between different modules, or logic multiplexing between parent classes and subclasses, information transfer, etc. A lot of information is encapsulated into functions, and then retrieved and converted when needed.

The following figure shows some of the relationships between the different modules:

335cdc8d6e5c43b4f423df32bc8717a0.png

References

  • oneflow v0.9.1(https://github.com/Oneflow-Inc/oneflow/tree/0ea44f45b360cd21f455c7b5fa8303269f7867f8/oneflow

  • SBP Signature(https://docs.oneflow.org/master/parallelism/02_sbp.html#sbp-signature

  • 2D SBP(https://docs.oneflow.org/master/parallelism/04_2d-sbp.html

  • placement api(https://oneflow.readthedocs.io/en/master/tensor_attributes.html?highlight=placement#oneflow-placement

  • https://segmentfault.com/a/1190000042625900

everyone else is watching

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/128979546