OneFlow source code analysis: basic computing interface Primitive

36278fbfd62a64a29dd59fffdda6ac1e.jpeg

Author|

Jianhua Zheng‍

Previously, section 5 of the OneFlow version update blog explained the "multi-device adaptation" of the framework, and the original text is excerpted as follows:

OneFlow provides a simple, efficient and easy-to-extend hardware abstraction layer EP (Execution Provider) to deal with the complexity of adapting to different hardware. After the hardware abstraction layer is introduced, users do not need to pay attention to the specific implementation details of the underlying hardware and the framework. Each module of the framework can be adapted to new hardware devices without modification. At the same time, users only need to follow the hardware abstraction interface agreement and the actual situation of the hardware device , realize a series of interfaces, and then the hardware adaptation work can be completed.

EP also defines a set of basic computing interface Primitive, and re-implemented Kernel based on Primitive interface. Compared with the runtime interface provided by EP, the interface provided by Primitive is more flexible, and different interfaces are independent of each other. Each interface represents a specific computing capability that a certain hardware device can provide.

The ep module mainly includes two parts. Part of it is the device management discussed before . According to the information provided by the user, the device instance can be obtained, and the device is abstracted from interfaces such as Stream, Event, and memory management.

The other part is the basic computing interface Primitive. Here is just a brief introduction to the concept of Primitive and what it includes. The design and implementation of specific calculations will not be involved.

What are Primitives?

Roughly speaking, the basic computing interface refers to the twenty or so basic computing interface classes defined in the Primitive directory. They are all subclasses of Primitive. These interface types usually only declare a Launch method, and which calculations are actually supported is determined by the implementation of specific devices.

Each basic computing interface is shown in the following table:

Primitive interface type

device implementation

Supported operations

Supplementary Note

Add

CPU, CUDA, OneDnn

DataType


BatchMatmul

BatchMatmulImpl

Whether to transpose

Forward to BroadcastMatmul

BroadcastElementwiseBinary

CPU, CUDA, OneDnn

BinaryOp

Support for BinaryOp operations

BroadcastElementwiseUnary

CPU, CUDA

UnaryOp

Support UnaryOp operation

BroadcastMatmul

BroadcastMatmulImpl

Whether to transpose

Both CPU and CUDA implementations are based on template classes

BroadcastMatmulImpl

Cast

CPU, CUDA

DataType


ConstantPad

CPU, CUDA

DataType


CopyNd

CPU, CUDA

DimSize


ElementwiseUnary

CPU, CUDA

UnaryOp

Support UnaryOp operation

Fill

CPU, CUDA

DataType


LogSoftmax
Backward

CPU, CUDA, OneDnn

DataType

Multiplexing with SoftmaxBackward

LogSoftmax

CPU, CUDA, OneDnn

DataType

Multiplexed implementation with Softmax.

The base class SoftmaxBase of SoftmaxImpl can be Softmax or LogSoftmax.

The mat

MatmulImpl

Whether to transpose

Forward to BatchMatmul

Memcpy

CPU, CUDA

Device Copy Direction

Host2Device、Device2Host ……

Memset

CPU, CUDA



Exchanges

CPU, CUDA, OneDnn

DimSize


SoftmaxBackward

CPU, CUDA, OneDnn


Multiplexing with LogSoftmaxBackward

Softmax

CPU, CUDA, OneDnn


Multiplexed implementation with LogSoftmax.

TensorFill

CPU, CUDA

DataType


Description of some computing interfaces

2.1 ElementwiseUnary

2.1.1 Execution process of relu kernel

The relu kernel performs calculations through ElementwiseUnary. Register the SetCreateFn function of the relu kernel to perform operations similar to the following code. The primitive_factory_func_ is saved when the UnaryPrimitiveKernel is constructed.

 
  
auto primitive_factory_func_ = [](user_op::KernelComputeContext* ctx) {
   const user_op::TensorDesc* src = ctx->TensorDesc4ArgNameAndIndex("x", 0);
   const user_op::TensorDesc* dst = ctx->TensorDesc4ArgNameAndIndex("y", 0);
   return ep::primitive::NewPrimitive<ep::primitive::ElementwiseUnaryFactory>(
       ctx->device_type(), ep::primitive::UnaryOp::kRelu, src->data_type(),
       dst->data_type());
 };
 OpKernel* ptr = new UnaryPrimitiveKernel("y", "x", primitive_factory_func_);

When calling UnaryPrimitiveKernel::Compute to perform kernel calculations, perform the following operations:

  • Call primitive_factory_func_ to get a primitive instance.

    • NewPrimitive

      • Call NewObjUniquePtr to get ElementwiseUnaryFactoryImpl instance (CPU, CUDA).

      • Calling ElementwiseUnaryFactoryImpl::New returns an ElementwiseUnaryImpl instance (CPU, CUDA).

  • Call primitive->Launch to perform calculations.

The relationship between the above classes is as follows:

49b34a0b178b3bcfe3bb912b383f1d8d.png

2.1.2 What operations does ElementwiseUnary support?

After the macro in ElementwiseUnaryFactoryImpl::New is expanded, the code is as follows. Find the New function according to the operation category and data type of UnaryOp, pass the corresponding template parameters to the New function and create an ElementwiseUnaryImpl instance.

The <operation, data type> combinations supported by ElementwiseUnary in the CPU environment are all registered in this map. This is part of the "Primitive interface" in the "conventional" sense (which operations, data types, etc. are supported), and the input parameters of the operation are determined by the interface of the Launch function.

 
  
static const std::map<
     std::tuple<UnaryOp, DataType, DataType>,
     std::function<std::unique_ptr<ElementwiseUnary>(Scalar, Scalar)>>
   new_elementwise_unary_handle {
     {std::make_tuple((UnaryOp::kRelu), DataType::kFloat, DataType::kFloat), NewElementwiseUnary<(UnaryOp::kRelu), float, float>},
     {std::make_tuple((UnaryOp::kRelu), DataType::kDouble, DataType::kDouble), NewElementwiseUnary<(UnaryOp::kRelu), double, double>},
     {std::make_tuple((UnaryOp::kElu), DataType::kFloat, DataType::kFloat), NewElementwiseUnary<(UnaryOp::kElu), float, float>},
     {std::make_tuple((UnaryOp::kLogicalNot), DataType::kDouble, DataType::kBool), NewElementwiseUnary<(UnaryOp::kLogicalNot), double, bool>},
     // ......
   };
 const auto it =
     new_elementwise_unary_handle.find(std::make_tuple(unary_op, src_type, dst_dtype));
 if (it != new_elementwise_unary_handle.end()) {
   return it->second(attr0, attr1);
 } else {
   return nullptr;
 }

2.1.3 Implementation of ElementwiseUnaryImpl::Launch

The Launch methods of different subclasses of Primitive have different implementation methods and input parameters. ElementwiseUnaryImpl::Launch implements computing logic (CPU, CUDA) through primitive::UnaryFunctor.

primitive::UnaryFunctor is a template class whose specializations are distributed in the following files:

  • A common UnaryFunctor implementation across devices. Among them is the implementation of relu.

  • UnaryFunctor implementation for CPUs. Parallel acceleration through cpu_stream->ParallelFor.

  • CUDA's UnaryFunctor implementation. Subsequent call to device calculation through cuda::elementwise::Unary.

2.2 BroadcastElementwiseBinary

BroadcastElementwiseBinary also defines the CUDA factory implementation. All operation combinations supported by CUDA are defined in the map of the New function, and each is a reference to a specialized instance of the NewBroadcastElementwiseBinary template function. The specializations of these template functions are defined in the following files:

  • broadcast_elementwise_binary_activation_grad.cu

  • broadcast_elementwise_binary_comparision.cu

  • broadcast_elementwise_binary_logical.cu

  • broadcast_elementwise_binary_math.cu

The macros in these files can be expanded with the following commands, and WITH_CUDA must be specified to expand the macros normally.

 
  
nvcc -DWITH_CUDA \
   -E -std=c++17 \
   -I. -Ibuild \
   -Ibuild/oneflow/ir/llvm_monorepo-src/llvm/include \
   -Ibuild/oneflow/ir/llvm_monorepo-build/include \
   -Ibuild/half/src/half/include \
   -Ibuild/_deps/glog-src/src -Ibuild/_deps/glog-build \
   -Ibuild/protobuf/src/protobuf/src \
   oneflow/core/ep/cuda/primitive/broadcast_elementwise_binary_math.cu > math.cpp

Relationship between UserOp, Kernel and Primitive

3.1 UserOp and Kernel have a one-to-many relationship

In the code I have seen before, UserOp usually has only one Kernel, and the Kernel does not distinguish between devices, and adapts to different device calculations through Primitive. But there are exceptions.

Through the conv kernel, we can see that the CPU and CUDA have registered the kernel with the same name. Take a closer look at the value type of UserOpRegistryMgr::op_kernel_reg_result_ is vector. So UserOp has a one-to-many relationship with Kernel. Filter out matching kernels by OpKernelRegistryResult::is_matched_hob.

Taking max_pool_2d as an example, its Kernel registration code is as follows:

 
  
REGISTER_USER_KERNEL("max_pool_2d")
   .SetCreateFn<MaxPool2dKernel<device, dtype>>()
   .SetIsMatchedHob((user_op::HobDeviceType() == device)
                 && (user_op::HobDataType("x", 0) == GetDataType<dtype>::value));

In the preparation phase of Kernel calculation, the relevant calls in StatefulOpKernel::ChooseOpKernel are as follows:

  • kernel_reg_val = UserOpRegistryMgr::Get().GetOpKernelRegistryResult(...)

    • Determine whether the Kernel matches by reg_val.is_matched_hob->get(ctx)

    • An error will be reported if there is no match. Alarm if more than one match

  • kernel = kernel_reg_val->create_fn()

3.2 What exactly is IsMatchedHob?

is_matched_hob is of type IsMatchedHob:

 
  
using IsMatchedHob = std::shared_ptr<hob::BaseExpr<user_op::KernelRegContext, bool>>;

(user_op::HobDeviceType() == device) && (user_op::HobDataType("x", 0) == GetDataType<dtype>::value) is not an ordinary bool expression, but a similar to the following figure Higher-order expressions:

778e38f87f4a822be209a6bf15d1c8bb.png

The type returned by HobDeviceType() is Custom, which is a subclass of Expr, and its ValueT is DeviceType. The DEFINE_BINARY_FUNCTOR macro defines a function that overloads the == operator of Expr. The first parameter type is Expr (that is, Custom), and the second parameter type is Custom::ValueT, which is DeviceType. The returned BoolFunctor inherits from BoolExpr , which is also a subclass of Expr. Similarly, overloading of the And operator is also defined through macros. This constitutes a higher-order bool expression as shown in the figure above. The BoolFunctor::get function dynamically calculates the value of an expression at runtime according to the context. For example, normalization is used to distinguish between training and reasoning.

The various types of relationships are as follows:

cbabce6de3024cb54c226299297292d1.png

3.2.1 Destructors for boolean expressions

BaseExpr is the base class of the bool expression objects mentioned above. Its destructor is not virtual. The code for SetIsMatchedHob is as follows. The specific type of T is determined when calling, and make_shared knows how to release it reasonably, so this scenario will not cause memory leaks.

template<typename T>
   OpKernelRegistry& SetIsMatchedHob(const T& hob) {
     result_.is_matched_hob = std::make_shared<T>(hob);
     return *this;
   }

References

everyone else is watching

Try OneFlow: github.com/Oneflow-Inc/oneflow/

7753204337dc90b2358e7335649a9382.png

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/131016530