How to write a deep learning compiler

c1d35e191a89ec9bde1c041e3b4d45b3.gif

A compiler is essentially a tool to improve development efficiency, converting a high-level language into a low-level language (usually binary machine code), so that programmers do not need to write binary by hand. During the conversion process, the primary task is to ensure correctness, and at the same time, optimization is required to improve the operating efficiency of the program. The input of a compiler in the traditional sense is usually some kind of high-level language, and the output is an executable program. In actual work, I have come into contact with the development of deep learning compilers, and its design ideas are very similar to traditional compilers. Therefore, this article uses the development of deep learning compilers and the deep learning compiler MegCC we actually developed as an example to illustrate how to write a translater. This article is mainly divided into the following two parts:

  1. Introduces deep learning compilers, focusing on front-end and back-end optimization methods in compilers.

  2. Taking MegCC as an example to introduce how to develop a deep learning compiler.

Introduction to Deep Learning Compilers

Different from traditional compilers, the input of a deep learning compiler is a neural network model, and the output is an executable program that expresses the calculation process of the input neural network model and can run on different platforms. However, the deep learning compiler is similar to the traditional compiler. It is divided into a front-end and a back-end. The front-end is responsible for performing hardware-independent optimization, and the back-end is responsible for performing hardware-related optimization. For the compiler, the two most important concepts are IR (intermediate representation, intermediate representation) and Pass. For humans, abstraction is an important way to understand complex things. IR is the abstraction of the intermediate products of the compilation process. IR usually has multiple levels. The higher-level IR is more abstract, and the lower-level IR is more specific. Pass defines how to gradually lowering the high-level IR to the low-level IR, and is responsible for optimization. The following is classified according to the front-end and back-end, and the optimization method is introduced.

Front-end optimization method

The front-end first needs to build a calculation graph based on the input model, generate high-level IR, and then perform a series of optimizations. Since the optimization is based on the calculation graph and does not involve specific calculations, the optimization is independent of the backend. Common optimization methods can be divided into three categories: node-level optimizations; block-level optimizations; dataflow-level optimizations.

  1. node-level optimizations. The optimization at the node level is mainly to eliminate some unnecessary nodes and replace some nodes with nodes with lower cost. For example, if matrix A is added to a 0-dimensional matrix, the addition operation can be eliminated.

  2. block-level optimizations. Optimization at the block level mainly includes algebraic simplification and operator fusion.

    a. Algebraic simplification, such as matrix multiplication of A^T and B^T, can be replaced by transposition after matrix multiplication of B and A, which can save one transposition operation.

    b. Operator fusion is a common deep learning optimization method. Although operator fusion cannot reduce the amount of calculation, it can reduce the amount of memory access and improve the calculation memory access ratio, thereby improving performance.

  3. dataflow-level optimizations. Optimization at the data flow level mainly includes static memory planning, etc.

    a. Static memory planning makes the memory used by the program run as small as possible by reusing memory as much as possible without overlapping memory.

backend optimization method

Common back-end optimizations include loop unrolling, loop fusion, masking memory access, etc. In addition, depending on the hardware, hardware-based parallel computing such as instruction mapping, vectorization, and manual compilation of assembly kernels can be used for targeted optimization. Figure 1 shows the commonly used back-end optimization methods [1] .

be22226bd9a353d69438ce7715e0139c.png

Figure 1 Commonly used optimization methods in the backend

MegCC

Next, take MegCC as an example to briefly introduce the implementation of a deep learning compiler based on MLIR. The key is how to define a series of IRs according to the requirements, and define Pass to lower the high-level IR to the low-level IR, and perform the above optimization at the same time.

Introduction to MegCC

The principle of MegCC implementation is: when the deep learning model is inferring, each Operator will correspond to a calculation kernel and complete the calculation, so the entire deep learning model executes the calculation kernel of all Operators at one time during inference, and the final inference can be obtained after the execution is completed. the result of. The traditional deep learning reasoning framework will do the following things at runtime:

  • Calculation graph optimization ----- mainly related to the model.

  • Kernel Selection ----- Select the appropriate Kernel for calculation according to the parameters for each Operator of the model.

  • Memory allocation ----- The size of memory allocation is determined by the model and the Kernel executed by each Operator in the model.

  • The Kernel that executes each Operator ----- is strongly related to the inference data.

Among the things that need to be done in the above-mentioned traditional deep learning reasoning, graph optimization, Kernel selection, and memory allocation are only related to the trained model and not related to the input data during reasoning, so these tasks can be done at the time of model compilation. The reasoning can be completed only by executing the Kernel of each Operator at runtime. MegCC optimizes the above diagram, selects the Kernel, and allocates memory in the compilation phase of MegCC, and puts the Kernel calculation of the Operator in the Runtime for calculation, which has the following advantages:

  • Runtime is very lightweight, an order of magnitude smaller than traditional reasoning frameworks, because Runtime only includes the necessary Kernels in the model, and irrelevant ones will not be compiled into it.

  • Improve performance, because the runtime only performs kernel calculations, so unnecessary overhead is avoided.

  • Kernel performance optimization, because each Kernel is customized for each Operator, so more in-depth optimization can be performed according to the parameters of the Operator.

  • Solve the operator long-tail problem after Operator fuse, for example, there is no limit to the type and quantity of activations fused after conv, more fuses can be supported, and the runtime size will not change significantly.

  • In addition, the runtime of MegCC is implemented in pure C, which can be easily transplanted to other embedded chips.

MegCC mainly consists of two parts, one part is the compiler part, the other part is the runtime part, and the compiler part related to compilation will be mainly introduced below.

MegCC compiler

The main process of Compiler is:

  1. Rely on MegEngine (our open source deep learning framework) for model import and static graph optimization (block-level optimizations, operator fusion, etc.).

  2. Convert the optimized model to a custom MGB IR based on mlir.

  3. MGB IR is converted to Kernel IR after a series of passes through Abstract Kernel IR.

  4. Export Kernel IR as runtime model and runtime kernel for the runtime part of MegCC.

cbce5fd06c5f9dae8fbfd032a6cf97ac.png

Figure 2 MegCC compiler process

IR in MegCC

MegCC defines a series of IRs based on MLIR. The IR definition of MLIR requires the user to define Dialect (see the official document for details), and then it will be converted into C++ representation by TableGen at the program compilation stage.

  • MGB IR: Defined as a one-to-one correspondence with the Operator in MegEngine, it is the entry IR imported by MegCC into the mlir system, which contains the type of each Opr and the parameters corresponding to this Opr, and each input and output variable is a Tensor, and is Single Assignment (SSA). See GitHub MegCC MGB IR for details.

  • Abstract Kernel IR: Abstract Kernel layer IR, which is mainly obtained after conversion of MGB IR above. The input and output in this IR have been lowering to Buffer, so it is no longer SSA. In addition, the properties of Opr are also defined by the enumeration value in MegEngine. converted into a string. See GitHub MegCC Abstract Kernel IR for details.

  • Kernel IR: Indicates the IR form after the Kernel has been generated. It no longer has the concept of Opr. The entire calculation graph is linked together through corresponding Kernels. The parameters of Opr are solidified in the defined Kernel. See GitHub MegCC Kernel IR for details.

The main Pass in MegCC

  1. MGBToKernelPass: This Pass mainly converts MGB IR to Abstract Kernel IR. During the conversion process, several things are mainly done:

    · Convert all input and output Tensor types in MGB IR to Buffer types.

    · Convert all enumerated parameters in MGB IR to corresponding characters, so that Abstract Kernel IR can be completely decoupled from MegEngine.

    Convert some Opr related to memory handling to Relayout, such as: Concat, SetSubtensor and other Opr (node-level optimizations)

    · It will judge whether the Opr is a static shape or a dynamic shape. The dynamic shape is the shape of the input tensor that needs to depend on the input value to be calculated. For example: output all numbers greater than 1 in a tensor. If it is a static shape, it is directly converted to Abstract Kernel IR, and if it is a dynamic shape, it is directly converted to the Instruction of Kernel IR.

  2. MGBFuseKernelPass: Applied to MGB IR, the mlir-based template matching method completes kernel fusion as much as possible, such as merging two consecutive typecvts into one typecvt (block-level optimizations, operator fusion).

  3. MemoryForwardingPass: It will traverse all the Opr in Abstract Kernel IR that may not need to be calculated and directly share input memory. If these Opr do not need to be calculated, they will be directly forwarded to memory. -level optimizations).

  4. KernelMaterializationPass: Load all Abstract Kernel IR with real Kernel code and convert it into KernelCall, and then add the corresponding KernelDef. KernelCall and KernelDef are matched by symbol.

  5. StaticMemoryPlanningPass: All static shape memrefs are used for memory planning. The memory planning algorithm uses the improved MegEngine memory planning algorithm - PushDown algorithm, which can greatly compress the runtime memory usage. At the same time, replace memref.Alloc of mlir with MemPlan of Kernel IR. MemPlan mainly records a whole memref of memory planning and the offset of the Tensor in the planned memory (dataflow-level optimizations, static memory planning).

The Pass above completes the graph optimization, memory planning, and Kernel generation of the model. The back-end optimization mentioned above is reflected in the Kernel generation stage. Currently, MegCC mainly uses artificially optimized Kernel templates. Finally, the compiled model can be dumped according to the model format defined in Runtime, and the Kernel file required to generate the calculation model can be generated. Let's take a simple model as an example, use MegCC's auxiliary tools (download the Release package) mgb-importer and megcc-opt, and observe the changes in IR processed by each Pass. You can also use the mgb-to-tinynn tool to directly complete the model compilation process, see the MegCC Getting Started document for details.

1. dump model (using megengine)

import megengine.functional as F
import megengine.module as M
import megengine.optimizer as optim
from megengine import jit
import megengine
import numpy as np


class MulAddNet(M.Module):
def __init__(self):
    super().__init__()


def forward(self, input):
    x = input * 2.
    x = x + 1.5


    return x


model = MulAddNet()
model.eval()


@jit.trace(symbolic=True, capture_as_const=True)
def infer_func(data, *, model):
    pred = model(data)
    return pred


data = megengine.Tensor([[1., 2.], [3., 4.]])
output = infer_func(data, model=model)
print(output)


infer_func.dump("MulAdd.mge", arg_names=["data"])

2. importer model to MGB IR

./bin/mgb-importer MulAdd.mge mulAdd.mlir
cat mulAdd.mlir
output:


module {
  "MGB.ParamStorage"() {sym_name = "const<2>[2]", sym_visibility = "private", type = tensor<1xf32>, user_count = 1 : i32, value = dense<2.000000e+00> : tensor<1xf32>} : () -> ()
  "MGB.ParamStorage"() {sym_name = "const<1.5>[4]", sym_visibility = "private", type = tensor<1xf32>, user_count = 1 : i32, value = dense<1.500000e+00> : tensor<1xf32>} : () -> ()
  func @mulAdd(%arg0: tensor<2x2xf32> {mgb.func_arg_name = "data"}) -> (tensor<2x2xf32> {mgb.func_result_name = "FUSE_MUL_ADD3(const<2>[2],data,const<1.5>[4])[14]"}) {
    %0 = "MGB.Reshape"(%arg0) {axis = 7 : i32} : (tensor<2x2xf32>) -> tensor<2x2xf32>
    %1 = "MGB.ParamProvider"() {name = @"const<1.5>[4]"} : () -> tensor<1xf32>
    %2 = "MGB.ParamProvider"() {name = @"const<2>[2]"} : () -> tensor<1xf32>
    %3 = "MGB.Elemwise"(%2, %0, %1) {mode = 35 : i32} : (tensor<1xf32>, tensor<2x2xf32>, tensor<1xf32>) -> tensor<2x2xf32>
    return %3 : tensor<2x2xf32>
  }
}

It can be seen that in the importer process, the multiplication and addition operations are fused into "FUSE_MUL_ADD3".

3.MGBToKernelPass、MemoryForwardingPass 和 StaticMemoryPlanningPass

./bin/megcc-opt --MGB-to-Kernel --memory-forwarding --static-memory-planning mulAdd.mlir > mulAdd_final.mlir
cat mulAdd_final.mlir
output:


#map = affine_map<(d0, d1) -> (d0 * 2 + d1)>
module {
  "Kernel.WeightStorage"() {sym_name = "const<2>[2]", type = tensor<1xf32>, user_count = 1 : i32, value = dense<2.000000e+00> : tensor<1xf32>} : () -> ()
  "Kernel.WeightStorage"() {sym_name = "const<1.5>[4]", type = tensor<1xf32>, user_count = 1 : i32, value = dense<1.500000e+00> : tensor<1xf32>} : () -> ()
  func @mulAdd(%arg0: memref<2x2xf32> {mgb.func_arg_name = "data"}, %arg1: memref<16xi8> {mgb.func_arg_name = "kGlobalBuffer"}) -> (memref<2x2xf32, #map> {mgb.func_result_name = "FUSE_MUL_ADD3(const<2>[2],data,const<1.5>[4])[14]"}) {
    %0 = "Kernel.Reshape"(%arg0) {axis = 7 : i32, determined = true} : (memref<2x2xf32>) -> memref<2x2xf32, #map>
    %1 = "Kernel.GetWeight"() {name = @"const<1.5>[4]"} : () -> memref<1xf32>
    %2 = "Kernel.GetWeight"() {name = @"const<2>[2]"} : () -> memref<1xf32>
    %3 = "Kernel.MemPlan"(%arg1) : (memref<16xi8>) -> memref<2x2xf32, #map>
    "Kernel.FUSE_MUL_ADD3"(%2, %0, %1, %3) : (memref<1xf32>, memref<2x2xf32, #map>, memref<1xf32>, memref<2x2xf32, #map>) -> ()
    return %3 : memref<2x2xf32, #map>
  }
}

After the above several passes, MGB IR is converted to Kernel IR and memory planning is performed. If you are interested, you can take a more granular look at what each Pass does, and use the parameters of megcc-opt to control which Passes are used.

Kernel generation

MegCC Compiler will generate a corresponding Kernel for each Operator in the model to complete the calculation. At present, most of the Kernels in MegCC are artificially optimized and pre-written Kernel templates, and these templates will generate corresponding Kernels according to specific Operator parameters. The reason for most artificially optimized Kernels is that the performance of the Kernel generated by mlir is still far from that of the handwritten Kernel when no parameters are searched on the CPU, but the method of automatically generating Kernels is preferable in the long run .

MegCC is now open source, warehouse address:

https://github.com/MegEngine/MegCC

Welcome to trial, star, issue.

Attached:

To get more information about MegEngine, you can: view documents, and GitHub projects, or join the MegEngine user communication QQ group: 1029741705. Welcome to contribute to the MegEngine community, become an Awesome MegEngineer, and enjoy endless certificates of honor and customized gifts.

reference

1. ^The Deep Learning Compiler: A Comprehensive Survey. MINGZHEN LI, YI LIU, etc. 2020.

2c64ae8303f485f4e85977e85897cc7d.gif

Guess you like

Origin blog.csdn.net/Megvii_tech/article/details/128663529