Doris Development Notes 4: Double-speed performance improvement, performance tuning practice of vectorized import

Recently at home, I will summarize some of the work I have done before. It happened that a small partner in the Doris community complained that the performance of vectorized import was not very satisfactory, so he took this opportunity to optimize the performance of the previously developed vectorized import work, and achieved good optimization results. Borrow this note to record some ideas of performance optimization, throw bricks to attract jade, I hope everyone will participate in the work of performance optimization.

1. Seemingly slow vectorized imports

problem discovery

Tucao from community users: Vectorization import is too slow. I tested the xx database and it is much faster than Doris. Is there a trick?

aha? Is it that much slower? I'll definitely have to take a look then.
So I reproduced the user case, and found that the user tested the stream load of ClickBench in the code base, about 80 G of data, and the vectorized import took nearly 1200s, while the non-vectorized import took 1400s.

vectorization non-vectorization
1230s 1450s

ClickBench is a typical scene of large and wide tables, and it is a model of Duplicate Key. In principle, it can give full play to the advantages of vectorized import. So it seems that there must be some problems, and you need to locate the hot spots according to the map:

Tips for Locating Hot Spots

The author usually has several ways to locate the hot spots of the Doris code. The combination of these methods can help us quickly locate the real bottleneck of the code :

  • Profile : The time-consuming of Doris's own records can be analyzed by using the Profile to roughly analyze the bottleneck points of the code part. The disadvantage is that it is not flexible enough. In many cases, it is necessary to manually write code and recompile to add the code we need to observe hotspots.

  • FlameGraph : Once the approximate hot spots are analyzed through Profile, the author usually quickly reads through the code, and then combines the flame graph to locate the hot spots of the function, so that the optimization is usually targeted . For the use of flame graphs, you can briefly refer to the developer manual of Doris's official documentation .

  • Perf : The flame graph can only roughly locate the hotspots of the aggregation function, and after the compiler is inlined and compiled and optimized, the function level of the flame graph alone may not be sufficient. It is usually necessary to further analyze the problem of the assembly code . At this time, you can use the perf mentioned in the development note 2 to locate the hot spot of the assembly language. Of course, perf is not a panacea. In many cases, we need to further tune it based on our familiarity with the code itself and some optimization experience.

Next, we will analyze this problem based on the above tuning ideas.

2. Optimization and code analysis

Based on the flame graph, the author sorts out several core hotspots during vectorization import. Targeted problem analysis and solution:

Slow Cast and String Handling

During the process of importing CSV into Doris, it needs to go through a process of text data analysis and expression CAST calculation. Obviously, this job is observed from the flame graph, and it is a big CPU consumer

It can be observed from the above flame graph that there is a very abnormal function calling Time-consuming FunctionCast::prepare_remove_prepare , which needs to be further analyzed according to the source code.

In the process of casting, the work of splitting null values ​​needs to be completed. For example, the operation process of String Cast Int needs to be completed here , as shown in the following figure:

Here, the original block and the column to be cast will be used to create a new temporary block to calculate the cast function.


The code marked in red above will do a lot of CPU computing work for std::set, which will affect the performance of vectorized import. In the scenario where the imported table itself is a large and wide table, the severity of this problem will be further amplified.

After locating the problem, the optimization work is very simple. Obviously, when performing a cast, we only need to perform the relevant columns of the cast calculation, and do not need all the columns in the entire block to participate. So the author implemented a new function here  create_block_with_nested_columns_only_argsto replace it create_block_with_nested_columns_impl. The original counting problem of more than 100 columns was reduced to processing one column, and the problem has been significantly improved.

before optimization Optimized
1230s 980s

Page fault optimization

After solving the above problems, continue to analyze the flame graph, and found that when data is written memtable, the following hot spots are generated: page fault interruption .

Here is a brief introduction to what is 缺页中断:

As shown in the figure above: When the CPU calculates the data, it will request to obtain the data in the memory. The memory address seen at the CPU level is: Virtual Address, which needs to be mapped from virtual addresses to physical addresses through a special CPU structure MMU . The MMU will go to the TLB ( Translation lookaside buffer , remember this is a cache) to find the corresponding virtual address to physical address mapping. Because in the operating system, memory is managed through pages, and the addresses are based on the offset of the page memory address , so this process becomes a task of finding the starting page address. If the memory page in the target virtual memory space does not have a corresponding page mapping in the physical memory, in this case, a page fault interrupt (Page Fault) occurs .

Page faults obviously impose some additional overhead:

  • Switching from user mode to kernel mode
  • The kernel handles page faults

Therefore, frequent page faults have a negative impact on the performance of the import, and it is necessary to try to solve it.

memory reuse

A lot of memory usage and addressing here are caused by operations on Column, so we have to try to solve this problem from the source of memory allocation.

The solution is also very simple. Since the page fault is caused by the lack of memory mapping, try to reuse the memory that has been used before. In this way, the problem of page fault will not be caused naturally. There is also a problem with TLB cache access. a higher affinity.

Doris internally supports ChunkAlloctorclasses for memory allocation, multiplexing, and core-binding logic, which ChunkAlloctorcan greatly improve the efficiency of memory application, and can also avoid page fault interrupts in the current case:

After replacing the memory allocation logic of podarray, the effect is also in line with expectations. Observing through the flame graph, the proportion of page fault interrupts has been greatly reduced, and considerable benefits have been obtained in terms of performance.

before optimization Optimized
980s 776s

3. Some related optimized TODOs:

  • CSV data format analysis: prefetch multiple rows of data through 4kb cache, and use SIMD instruction set to further optimize performance

  • Optimization of page fault interruption: for the problem of page fault during part of memory allocation and copying, consider introducing a large page memory mechanism to further optimize page fault interruption and page memory cache

4. Summary

Of course, the vectorized import work done by the author is only a part of Doris' vectorized import work. Many students in the community have also deeply participated in related work, and have obtained more ideal performance on the current basis. In short, the work of performance optimization is never-ending.

Here is also a special thanks to two students in the community for their code review and analysis help: xinyiZzzGabriel

Bingo! Please look forward to the next 1.2 version of Doris, which is fully vectorized. I believe it will surprise you in terms of performance and stability .

Finally, I also hope that everyone will support Apache Doris and contribute code to Doris, thank you~~

5. References

Page Fault
Apache Doris Source Code

Guess you like

Origin blog.csdn.net/weixin_47367099/article/details/127536326