Introduction to Code Generation Technology in MaxCompute

Abstract:  Preface In "Introduction to Code Generation Technology in Database Systems", we briefly introduced Code Generation technology and its importance in large-scale OLAP systems, especially large-scale distributed OLAP systems. MaxCompute uses Code Generation technology to improve computing efficiency. in MaxCompute

foreword   

   In "Introduction to Code Generation Technology in Database Systems", we briefly introduced Code Generation technology and its importance in large-scale OLAP systems, especially large-scale distributed OLAP systems. MaxCompute uses Code Generation technology to improve computing efficiency. In MaxCompute 2.0, we introduced the JIT (Just In Time) Code Generation technology based on LLVM. Combined with the vectorized execution engine and the optimization of execution efficiency based on SIMD technology, compared with MaxCompute 1.0, MaxCompute 2.0 has greatly improved the performance convenience. For details, please refer to "MaxCompute 2.0 Performance Evaluation: More Powerful and Efficient" faster".

Code Generation in MaxCompute 1.0


   As shown in the figure above, MaxCompute 1.0 adopts the static code generation technology, and the work is mainly completed on the role named "Executor" in the MaxCompute control cluster. The process is as follows:

  1. After the user's SQL statement undergoes Parsing and Optimization on the Executor, the corresponding query plan is generated.
  2. The Code Generation module on the Executor translates the query plan into a C++ source file named "mapred.cpp". As shown in the figure above, each Task in the query plan (that is, a Stage in the MaxCompute job) will be translated into a Class in C++, and all processing logic will be generated into the Process() method of the Class.
  3. Executor calls g++ to compile "mapred.cpp" into a dynamic library, and distributes it to each worker in the computing cluster.
  4. The dispatched Worker will load the dynamic library and call the corresponding Process() method to complete the calculation logic.

   It can be seen that using the Code Generation technology, the code is customized for each SQL execution time, so the execution efficiency is better than the traditional Volcano Model. However, there are also some problems.

  1. g++ compilation still consumes CPU/memory, especially when the optimization option is turned on to O2 or higher. Especially when the user SQL is relatively complex (some SQLs have thousands of expressions in the SELECT statement, or the expressions are deeply nested), the generated C++ source files are also relatively large, and compilation is more time-consuming. In actual production, we have seen situations where compilation takes tens of seconds and consumes gigabytes of memory.

  2. The transmission of the generated dynamic library between the control cluster and the computing cluster will also bring a certain network overhead. Because this dynamic library is closely related to SQL logic, it cannot be reused. Therefore, each SQL will go through the process of compilation and distribution. When tasks are submitted frequently, controlling the stability of the cluster will face certain challenges.
  3. Because of the high compile-time overhead, this Code Generation method has a large overhead in the scenarios of processing complex statements and small and medium-sized data queries, such as in service mode.

Code Generation in MaxCompute 2.0


 MaxCompute 2.0 adopts the JIT Code Generation technology based on LLVM. The so-called JIT is that the program dynamically generates corresponding machine instructions as needed during the runtime. In this way, the work of the entire Code Generation is handed over from the control cluster to each Worker of the computing cluster that actually executes the computing logic. The process is as follows:

  1. As in MaxCompute 1.0, the user's SQL statement is subjected to parsing and optimization on the Executor to generate a corresponding query plan.
  2. The query plan is directly sent to each worker in the computing cluster.
  3. The Code Generation module of the MaxCompute 2.0 execution engine loads the query plan, and uses the LLVM C++ API to generate the corresponding machine code. The Code Generation module returns a function pointer as the entry point for the call.
  4. Worker completes the calculation logic by calling the function pointer returned by the Code Generation module.

   Compared with MaxCompute 1.0, the code generation speed in MaxCompute 2.0 is significantly improved. In 1.0, the average code generation time of a SQL is about 2-3s, and this time is shortened to 100-200ms in 2.0. In 2.0, Code Generation is all completed on the workers of the computing cluster, which relatively reduces the pressure of controlling the cluster and helps MaxCompute to control the stability of the cluster. In addition, because the execution engines of MaxCompute 2.0 are multiplexed (not different because of different SQL), there is no need to transfer dynamic libraries between the control cluster and the computing cluster as in 1.0, which reduces the need for communication between the control cluster and the computing cluster. network load.

Follow-up

   Currently, the execution engine of MaxCompute 2.0 is still based on the Volcano Model. It just transfers data between operators in the Volcano Model in Batch mode, and increases the execution speed in a column-like manner. JIT Code Generation based on LLVM is now mainly used in expression calculation, Streamline and other hot parts. After that, we are ready to try the Code Generation of Full Stage, similar to http://www.hyper-db.com/.  Interested students can take a look at this: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf . The attached PDF combines the "Introduction to Code Generation Technology in Database System" and part of this article. Interested students can use it as a reference.

Original link

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325725737&siteId=291194637