Summary of Deep Learning Compilers

       The development of deep learning has had a profound impact on various fields of science. Not only has it shown significant value in artificial intelligence fields such as natural language processing (NLP) and computer vision (CV), but it has also achieved great success in broader application areas such as e-commerce, smart cities, and drug discovery. With the emergence of various deep learning models such as convolutional neural network (CNN), recurrent neural network (RNN), long-short-term memory (LSTM) and generative adversarial network (GAN), the programming of various DL models is simplified to realize their Widespread adoption is critical.

        With the continuous efforts of industry and academia, some popular DL frameworks have been proposed, such as TensorFlow, PyTorch, MXNet, and CNTK, to simplify the implementation of various DL models. Although the above DL frameworks have their advantages and disadvantages according to the trade-offs in their design, interoperability becomes very important to reduce redundant engineering efforts when supporting emerging DL models across existing DL models. To provide interoperability, ONNX is proposed, which defines a unified format for representing DL models to facilitate model conversion between different DL frameworks.

       To address the shortcomings of DL libraries and tools and ease the burden of manually optimizing DL models on each DL hardware, the DL community is promoting the development of domain-specific compilers. Several popular DL compilers have been proposed by industry and academia, such as TVM, Tensor Comprehension, Glow, nGraph, and XLA. A DL compiler takes the model definition described in the DL framework as input and generates efficient code implementations on various DL hardware as output. The conversion between model definition and concrete code implementation is highly optimized for model specification and hardware architecture. Specifically, they incorporate DL-oriented optimizations such as layer and operation operator fusion (such as Conv+BatchNorm), which makes code generation more efficient. In addition, existing DL compilers also leverage mature toolchains from general-purpose compilers such as LLVM, which provide better portability across different hardware architectures. Similar to traditional compilers, DL compilers also adopt a layered design of front-end, intermediate code, and back-end. However, the uniqueness of DL compilers lies in the design of multi-level IRs and DL-specific optimizations.

  • Deep Learning Framework

Figure 1. Deep learning framework

Figure 1 shows the prospect of DL frameworks, including current popular frameworks, historical frameworks and frameworks supported by ONNX.

TensorFlow : Among all DL frameworks, TensorFlow has the most comprehensive support for language interfaces, including C++, Python, Java, Go, R, and Haskell. In order to reduce the complexity of using TensorFlow, Google adopted Keras as the front end of TensorFlow core.

Keras : A high-level neural network library for quickly building DL models, written in pure Python. Although Keras itself is not a DL framework, it provides a high-level API that integrates with TensorFlow, MXNet, Theano, and CNTK. Using Keras, DL developers can build a neural network with just a few lines of code. However, Keras is not flexible enough due to over-encapsulation, which makes it very difficult to add operators or obtain low-level data information.

PyTorch : Facebook released PyTorch by rewriting the Lua-based DL framework Torch in Python and refactoring all modules at the Tensor level. As the most popular dynamic framework, PyTorch embeds primitives in Python for building dynamic dataflow graphs, where the control flow is executed in the Python interpreter. PyTorch 1.0 integrates the code base of PyTorch 0.4 and Caffe2 to create a unified framework. This enables PyTorch to absorb the advantages of Caffe2 to support efficient graph execution and mobile deployment.

Caffe/Caffe2 : Caffe was designed by UC Berkeley for deep learning and image classification. Caffe has APIs for command line, Python and matlab. The simplicity of Caffe makes the source code easy to expand and suitable for in-depth analysis by developers. Therefore, Caffe is mainly oriented towards research, which has made it popular from the beginning until now. Caffe2 is built on the original Caffe project. Caffe2 is similar to TensorFlow in code structure, although its API is lighter and it is easier to access intermediate results in the calculation graph.

MXNET : MXNET supports multiple language APIs, including Python, C++, R, Julia, Matlab, and JavaScript. It is designed for scalability and designed from the perspective of reducing data loading and I/O complexity. MXNet offers different paradigms: declarative programming like caffe and Tensorflow and imperative programming like PyTorch. CNTK : Available through APIs in Python and C++, or you can use your own scripting language (i.e. BrainScript). CNTK is designed to be easy to use and produce, ready for large-scale data in production. It uses a static computational graph similar to TensorFlow and Caffe, where a DL model is viewed as a series of computational steps via a directed graph.

Paddle-Paddle : The original design is similar to Caffe, where each model can be represented as a set of layers. However, Paddle-Paddlev2 references TensorFlow's operator concept, which decomposes layers into finer-grained operators, thus supporting more complex DL models. And paddle Fluid is similar to PyTorch in that it provides its own interpreter, thus avoiding the limited performance of the Python interpreter.

ONNX : Open Neural Network Exchange (ONNX) defines a scalable computational graph model, so computational graphs built by different DL frameworks can be easily converted to ONNX. Converting models between DL frameworks becomes easier with ONNX. For example, it allows developers to build MXNet models and then use PyTorch to run the models for inference. As shown in Figure 1, ONNX has been integrated into PyTorch, MXNet, paddle, etc. For some DL frameworks that are not yet directly supported (such as TensorFlow and Keras), ONNX adds converters to them.

Deprecated frameworks: Due to the rapid development of the DL community, many historical DL frameworks are no longer active. For example, PyTorch has replaced Torch. As one of the oldest DL frameworks, Theano is no longer maintained. Chainer used to be the framework of choice for dynamic computational graphs, but was replaced by MXNet, PyTorch, and TensorFlow with similar characteristics.

  • Deep Learning Compiler Architecture Design

The general design architecture of DL compiler mainly consists of two parts: front-end and back-end, as shown in Figure 2. Intermediate code (IR) is distributed across the frontend and backend. In general, IR is an abstraction of a program and is used for program optimization. Specifically, the DL model is transformed into a multi-level IR in a DL compiler, where the high-level IR resides in the front-end while the low-level IR resides in the back-end. The compiler front-end is based on a high-level IR responsible for hardware-independent transformations and optimizations. Based on the underlying IR, the compiler backend is responsible for hardware-specific optimizations, code generation, and compilation.

Figure 2. Overview of the generally adopted DL compiler design architecture

High-level IR: Also known as graph IR, it represents computation and control flow, independent of hardware. The design challenge of high-level IR is the abstraction capability of computation and control flow, which can capture and represent various DL models. The goal of high-level IR is to establish control flow and dependencies between operators and data, and to provide an interface for graph-level optimization. It also contains rich semantic information for compilation and provides extensibility for custom operators.

Low-level IR: designed for specific hardware optimization and code generation for different hardware targets. Therefore, the underlying IR should be fine-grained enough to reflect hardware characteristics and represent hardware-specific optimizations. It should also allow the use of mature third-party toolchains such as Halide, polyhedral and LLVM on the compiler backend.

The front end obtains a DL model from the existing DL framework as input, and then converts the model into a computational graph representation (as shown in IR). The front end needs to implement various format conversions to support different formats in different frameworks. Computational graph optimization combines general-purpose compiler optimization techniques and DL-specific optimization techniques to reduce redundancy in computational graphs and improve computational efficiency. This optimization can be divided into node level (such as nop elimination and zero-dimensional tensor elimination), block level (such as algebraic simplification, operator fusion and operator sinking) and data flow level (such as CSE, DCE, static memory planning and layout transition). After the frontend, an optimized computation graph is generated and passed to the backend.

The backend converts high-level IR to low-level IR and performs hardware-specific optimizations. On the one hand, it can directly convert high-level IR to third-party toolchains (such as LLVM IR) to leverage existing infrastructure for general-purpose optimization and code generation. On the other hand, it can leverage the DL model and prior knowledge of hardware characteristics to more efficiently generate code through a customized compilation process. Common hardware optimizations include hardware internal mapping, memory allocation and acquisition, memory latency hiding, parallelization, and loop-oriented optimization. In order to determine the optimal parameter settings in a large optimization space, existing DL compilers generally adopt two methods: automatic scheduling (such as polyhedral) and automatic tuning (such as AutoTVM).

Guess you like

Origin blog.csdn.net/lily_19861986/article/details/131451815