Value of the data mining tool! Ali cloud in real time the number of positions AnalyticDB PG

purpose

With the advent of the digital economy, more and more applications rely on data analysis to tap the value of the data. As big data storage, an important foundation for the online analysis system, the analytical database (OLAP) technology provide an important platform for online data value.

Alibaba OLAP team after research found that the existing database to perform OLAP engine is often based on existing OLTP execution engine on the secondary development from, there is a performance loss, heavy historical burden, not take advantage of the latest optimization technology, did not give full play to the advantages of the new hardware and other issues.

With the rapid growth of data and data analysis has been strong demand, the amount of computing OLAP systems also need to take the exponential growth in the performance of existing systems can not meet the future needs of online data analysis. After analysis, OLAP Alibaba team believes that the real cloud era fully oriented, the execution engine for OLAP era of digital economy, should have the following technical characteristics:

•  supports a variety of hardware platforms . To meet the needs of enterprise clouds, supports a variety of hardware platforms. In addition to supporting the traditional X86 platform should also be compatible with ARM platforms, while supporting the use of GPU, FPGA and other emerging hardware acceleration.

 the ultimate price . Take advantage of hardware performance, the best selection algorithm and operators to achieve, improve hardware efficiency. For dense, complex calculations, using a special algorithm special hardware to improve efficiency.

 SQL compatibility high . Highly compatible with existing SQL standard, the standard optimizer, memory standard, thus reducing the migration of difficulty and learning curve. Users do not need to rewrite SQL, without having to re-learn optimization techniques, without the need to migrate data only needs to be a small version of the software upgrade, you can enjoy the latest performance optimization techniques.
01.png

To address the above requirements, Alibaba OLAP AnalyticDB Postgres (referred to as ADB PG version) research and development team took more than a year, we developed a new computational engine Odyssey. Odyssey calculation engine includes the following main technical features:

•  supports a variety of hardware and software level units. In addition to traditional X86 platform, the server also supports the ARM chip; support using GPU, FPGA core algorithm accelerator portion.

 strong performance . To abandon the traditional database implementations execution engine, take advantage of hardware efficiency. Algorithmically design eliminates the performance problems associated with volcanic model, fragmented memory allocation, and other logic redundancy, the valuable CPU resources used for core calculation; LLVM using dynamic code generation (CodeGen), to enhance computing performance expression, streamlined computational logic, logical computing to achieve perfection "downsizing"; a "supercomputer" optimization techniques, modeling of operator performance, a comprehensive analysis of the code hot spots, strict inspection to achieve performance of each line of code; take advantage of new hardware features, the use of CPU the SIMD technology, high bandwidth of the GPU technology, FPGA pipeline depth, to further improve performance.

• compatibility . Born in the Odyssey calculation engine ADB PG version, and PostgreSQL's SQL standard, the optimizer, the storage layer multiple levels to achieve perfect compatibility, users do not need anything, you can migrate from the primary engine to Odyssey calculation engine.

The following section of this article, we will design the architecture, the core technology point, three dimensions of performance evaluation, describes the characteristics of the new calculation engine Odyssey.

Odyssey calculation engine introduced

The basic framework

The figure is the system frame ADB PG version, and the relationship between the ADB PG Odyssey computation engine version.
02.png

ADB PG is a MPP architecture, highly available, scalable, distributed database systems analysis. A cluster of ADB PG Segment Master node and nodes, interconnected through a network. Master node is responsible for receiving a user request, parses SQL / optimization, issued computing tasks. There may be multiple data exchange (shuffle) Segment operating between the node is responsible for storing data, perform computing tasks, Segment node. Speed ​​execution performance networking, storage performance, Segment nodes, all have a major impact on the performance of ADB PG. This release of Odyssey calculation engine, mainly to enhance the execution performance of the Segment.

Odyssey calculation engine development goal is a more efficient computing for the ADB PG new engine, while maintaining the original modules unchanged. ADB PG's Master node after receiving the user's SQL request to SQL parsing, optimization, execution plan and then delivered to Segment, Odyssey calculation engine has no effect on this link, so users do not need to modify the SQL, without having to re-optimize SQL .

After Segment received issued by the Master node implementation plan, if Odyssey opened, will no longer call the native execution engine, but calls Odyssey calculation engine. In the storage level, Odyssey Heap table calculation engine and ADB PG native, AOCS table, AORO table are fully compatible, reuse storage structure and native storage access interfaces, Odyssey focuses on executive level. In theory, Odyssey and ADB PG native engine optimizer optimize, optimize storage at different levels of optimization are orthogonal, it can be seamless.

03.png

Odyssey engine performs calculation shown in the above flow in FIG. By hook Odyssey access ADB PG, become transparent to the user a new engine. ADB PG performed at the beginning, by selecting a different function in the SQL Hook, to select a different execution engines. For the user, you can choose native execution engine can also choose to GUC Odyssey execution engine by setting parameters. The next version will be automatically selected by the cost model and calculation engine rules, can give full play to their strengths.
In order to ensure the integrity of the stability and function of the system, Odyssey computing engine supports rollback execution engine, when the function is not supported or fails, automatic fallback to native execution engine. Since the execution engine itself is stateless, a user according to the needs, any size (library level, session level, the level of the SQL) execution engine switch.

Design ideas

Based on the above architecture, Odyssey ADB PG computing engine takes full advantage of its advantages, reuse stores, optimization and transaction control, Odyssey focus on improving computing performance, complement the native ADB PG powerful, feature-rich. In order to enhance computing performance by native execution engine to analyze a variety of horizontal contrast execution engine, we designed a new calculation engine. Calculation Engine redesigned Odyssey plan Node, and codegen Executor, correspondence between the three, as shown, node to a plan corresponding to each cell and a codegen the Executor, codegen generating unit according to the plan Node code (IR), the Executor select different hardware platforms rear end, to a corresponding IR generated executable file, and perform resource request, the final result back to the client through the master.

04.png

Perform model optimization. Odyssey calculation engine gave up the traditional model volcano, use batch (batch) volcano model, batch execution reduce function call overhead, but also easy to use vectorized data processing computing to enhance efficiency, but also makes development more flexible, optimization more space.

Time compilation technology. Odyssey calculation engine uses Just-in-Time (JIT) technology, using LLVM dynamically generated code. The use of JIT technology on some of the core operations, to solve the high-level language level of abstraction of too much performance overhead can be calculated for expression, assembly-level complex logic operations to optimize, maximize the size of the compressed instruction.

Memory management optimization. Presence of fragmented memory allocation, memory fragmentation problems copies properties ADB GP native execution engine. Odyssey compute engine algorithm layer was redesigned, modular memory allocation, maximize reuse of allocated memory, greatly reducing the number of memory allocations.

Supercomputing software optimization. Odyssey computing development engine, uses a lot of optimization techniques in software development "supercomputer." In supercomputing software optimization, establishing theoretical performance model for detailed performance analysis (Profiling), a rigorous examination of code performance, can effectively help improve software performance. Odyssey comprehensive calculation engine takes appropriate technical, so each operator, each code are conducted in-depth optimization.

Multi-platform compatible. In addition to traditional X86 platform, Odyssey calculation engine is compatible with other platforms X86 and ARM platforms, while supporting different versions of the Linux operating system, provides users with more choice and optimization of space.

Hardware acceleration optimization. Odyssey GPU computing engine capable of using large throughput and high density FPGA computing power. For some computationally intensive, high-throughput requirements of the operator, the GPU accelerated; relative to the fixed portion for calculation logic, logic operator complex, the use of FPGA accelerated.

The core technology point

Perform model optimization

05.png

The figure shows TPCH-Q3 execution plan. In the ADB PG in a SQL it will be transformed into a tree implementation plan. Traditional traditional database model volcano, driving sub-node is a parent node execution, the child node data is returned to the parent node in a row, then the parent node before proceeding. Volcano model has three core performance issues:
(1) function call too much, each row of data transfer requires at least one function call, large CPU overhead;
(2) optimization of difficulties, some logical operations across multiple functions and can not be optimized;
after (. 3) CPU utilization rate, an operation of a CPU to complete the operator, to the upper node immediately return to do another operation, the logic multi-jump, is not conducive to improve CPU efficiency.
Odyssey calculation engine to model volcano was completely rewritten, the original tree multiplexing architecture, based on the use of batch (batch) to enhance the implementation of execution performance.

After volcanic model, the unit transfer data between one row of nodes, each node generates a row of data layers is pulling the upper node, and then calculate the next stage. While the calculation engine in batch Odyssey units of data transfer. Batch comprising a plurality of data lines (typically thousands of rows of data), using the priority column format for data storage. After a node generates multiple rows of data to fill a batch, it will return the data to the upper node, eliminating performance problems volcano model. On the one hand, the program will model number of function calls volcanic reduced three orders of magnitude, to eliminate the performance impact caused by excessive function calls; on the other hand, reduces operating across function, help to optimize the algorithm level, for each operator to select sub-optimal algorithm; Finally, the reduced logic jump operation, performing the resource can be effectively utilized in modern CPU.

Memory Management

Odyssey calculation engine continue with the ADB PG memory management module. Meanwhile ADBPG native memory management module in addition to the performance has improved, but also be able to automatically track the memory allocation, enabling automatic memory release, performance improvements avoids the problem of memory leaks. The module has been tested on a large scale practice of the cloud, so the project. Odyssey compute engine follows this module, but has been optimized in the use of memory, reducing unnecessary fragmentation of memory allocation and memory allocation.

On the one hand, Odyssey will try to reuse memory allocated to avoid unnecessary multiple memory allocation. Volcanic model, some nodes will allocate memory space for the new line of data to generate a row of data, so each row of data may require a memory allocation. Odyssey abandoned the volcano model, the use of batch data transfer. Each batch of data is consumed upper node finish line, Odyssey will reuse of this batch of memory space, without re-allocated.
// Native Engine: fragmented distribution in rows allocate memory

for (...) {
    void* hash_entry = palloc(32);
    
    ... // copy data
}
// Odyssey引擎:整块内存分配。
void* hash_table = palloc(32*1000,000);
for () {
    ... // copy data
}

On the other hand, when a large number of fast memory allocation, the Odyssey select the more efficient distribution (as shown in the above pseudo code). For example, when Hash Join build a hash table, insert a single row of data, ADB PG native execution engine, will allocate memory for this line before inserting each row of data. Therefore, each row of data will need to re-once memory allocation, this program will bring a lot of fragmented memory allocation. 1M e.g. Hash table has rows, each row 32 bytes, 32 bytes 1M cause fragmentation of memory allocation. For modern systems, fragmentation of memory allocation performance impact is very large, not only the distribution process is very time consuming, ongoing management, tracking, release process will be very time consuming.

Odyssey of the ADB PG native execution engine fragmented memory allocation problem diagnosis and cause analysis, investigation and a few key algorithmic problems lead to the fragmentation of memory allocation. By algorithm design, Odyssey avoid fragmentation of memory allocation problem, so without reducing memory usage efficiency, while improving the performance of memory management.

Codegen

Calculation engine using Odyssey (code generation) technology to reduce the virtual function call, to enhance the performance of complex logical expressions and determination logic based on the LLVM Codegen. Unlike codegen PostgreSQL in the expression 11, Odyssey calculation engine is codegen the entire operator. Codegen currently mainly used for the optimization of three levels, operator optimization, and optimization logic expressions Code Specialization (Code specialization).

Operator optimization. CodeGen employed, the entire operator can generate a function, a virtual function call and reduce redundant code, may generate a plurality of function operators, not only has the above advantages, but also to reduce the use of memory and memory copy. Join + group by for example, a scene, a function can be integrated into the join and AGG implemented in a function. Especially if the hash key and group key are the same, only we need to build a hash table.

Logical expression optimization. The following code examples to explain how to achieve this optimization. For a> 10 and b <5 such a filter, ADBPG is achieved by three function calls, the first function call Int32GT (a, 10) to achieve a> 10, the second function call Int32LT (b, 5 ) achieved b <5, And the third function operation implemented. The program takes three function calls, have a significant impact on performance. While the calculation engine Odyssey scheme employed, using LLVM LLVM generate the IR, the underlying compiled into machine code, only three instructions to complete this operation. By this scheme, the Odyssey calculation engine operation to avoid a lot of redundancy, it is possible for a different expression, different logical determination, the underlying code generated minimized.

// 实例SQL
select count(*) from table where a > 10 and b < 5;
// ADBPG表达式方案:多次函数调用
result = Int8AndOp(Int32GT(a, 10), Int32LT(b, 5));
// Odyssey方案:生成最小化底层代码
%res1 = icmp ugt i32 %a, 10;
%res2 = icmp ult i32 %b, 5; 
%res = and i8 %res1, %res2;

Code Specialization.  For some known logic perform full determination may be made by the LLVM code specialization, thereby eliminating the logical judgment. For example, certain data structures of the type SQL parsed before execution engine started is determined, such as the output data type of each node, the data length. If a high-level language (such as C) is performed, since the write code developers do not know the type of data need to be determined for each type of element length. However, at the time of code generation LLVM determines the type of data, thus generates the code without performing a final judgment, to direct manipulation. Since the engine often need to perform data type, length is determined, although each save very little time, but the cumulative down performance improvements obvious.

Vectorization execution

Volcanic model will pull the data line to the data taken by volcanic model pull batch, combining the advantages of storage columns, the processor can be further digging performance. For example, the model may be a batch process using the vector instruction of the CPU or GPU manycore advantages to further improve performance.

// 伪代码样例:利用SIMD指令加速聚合  
for (int i = 0; i < round_row; i += 8) {
    __m256 m_tmp = _mm256_load_ps(&(aligned_input_data[i]));
    __m256 ymm2 = _mm256_permute2f128_ps(m_tmp , m_tmp , 1);
    m_tmp = _mm256_add_ps(m_tmp, ymm2);
    m_tmp = _mm256_hadd_ps(m_tmp, m_tmp);
    m_tmp = _mm256_hadd_ps(m_tmp, m_tmp);
    result += m_tmp[0];
  }

Aggregation e.g. OLAP commonly used operations using SIMD instructions, computational performance can improve by about 4 times, the pseudo-code shown above. Some more sophisticated algorithm may utilize many-core GPU computing capability, such as a parallel filtering, aggregation, algorithm specification and the like, can enhance computing performance up to an order of magnitude.

Cross-platform support

The era of cloud computing, cloud platform to support a variety of different hardware resources, in addition to traditional x86 platform, x86 platform domestic, domestic ARM platform, GPU, FPGA resources can support, users only need to purchase the appropriate resource specifications to meet the user different scenarios needs. Odyssey ADB PG calculation engine based on the above, further expand the scope of support, the current Odyssey computing engine has the perfect platform to run on x86 and ARM platforms.

06.png

In addition, Odyssey computing engine supports the use of GPU and FPGA accelerated key operator. As a many-core GPU architecture, ample computing resources and bandwidth resources for data-intensive, high-throughput scenarios. An FPGA as customized processors, very complex operations for some logic, effectively improve the utilization of hardware resources. Currently Odyssey GPU acceleration calculation engine supports high density portion operator, can be realized more than 5 times the performance. FPGA accelerated storage layer supports the use of compression and decompression, the highest performance increased by more than 10 times.

Performance Results

The test uses a standard test set TPCH using TPCH official tool to generate the SQL data and 22, the amount of data 1TB, scale factor = 1000, cluster size are 32 segment. 22 from Q1 to Q22 single SQL concurrent order, performed continuously sent every SQL referred to as the start time, end time returns a value, execution time is the difference between the two.

X86 platform test results

x86 platform testing was performed on Ali cloud, examples of specifications are as follows:
Audit single node: 4
single node memory: 32G
single-node disk: 320G SSD
instance nodes: 32
test results are as follows:
07.png

As can be seen from the figure, the Odyssey calculation engine has obvious advantages over the native engine, Q1, Q4, Q9, Q11, Q17, Q20, Q21, Q22 these SQL more than 1 fold increase, wherein the Q17 there are more than two times performance improvements. And to increase the percentage of the type of SQL, more obvious advantage for computing intensive SQL. For IO intensive SQL, we did adapt and optimize on the IO interface, performance calculated colleagues, IO performance has been raised.

ARM platform test results

ARM platform using three test server, as a master, the other two compute nodes. Each segment 16 deployed on the compute nodes, a total of 32 segment. Configure each server as follows:
08.png

CPU: 128 cores
Memory: 378G
disk: 3.84Tx4 AliFlash
can be seen from the figure, the calculation engine in the ARM platform, Odyssey also has a comparative advantage, and relatively x86 platform, Odyssey on ARM platforms more obvious advantages.

to sum up

These are some of the technical details of the new calculation engine Odyssey ADB PG, in general, a new technology using Odyssey: AVX instruction of the CPU: the JIT, New Model: batch (BATCH) of volcanic model, new hardware features , the many-core GPU computing, special purpose computing FPGA. In TPCH found on public cloud 1TB total time of 22 statement is 50% of the original engine, a single SQL performance up to 2x more; mounted Odyssey compute engine on ARM server ADB PG has double performance, and relatively X86 platform, Odyssey engine upgrade more obvious.
The next Odyssey calculation engine will fit in the storage layer optimization, more extreme performance optimization efforts, please you attention!

Original link
This article Yunqi community original content may not be reproduced without permission.

Released 2315 original articles · won praise 2020 · Views 1.39 million +

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/105050331