How Apache Doris 2.0 Achieved 2-8x Import Performance Improvement

Data import throughput is one of the important measures of OLAP system performance. Efficient data import capabilities can accelerate the efficiency of real-time data processing and analysis. As the scale of Apache Doris users continues to expand, more and more users put forward higher requirements for data import, which also brings greater challenges to the data import capabilities of Apache Doris.

In order to provide fast data writing support, the Apache Doris storage engine adopts a structure similar to LSM Tree. When importing data, the data will first be written into the MemTable corresponding to the Tablet, and the MemTable adopts the data structure of SkipList. When the MemTable is full, the data in it will be flushed to disk. The process of flushing data from MemTable to disk is divided into two stages. The first stage is to convert the row-storage structure in MemTable into a column-storage structure in memory, and generate a corresponding index structure for each column; the second stage is to convert The converted column storage structure is written to the disk to generate a Segment file.

Specifically, Apache Doris divides the BE module into upstream and downstream in the import process, and the processing of data by the upstream BE is divided into two steps: Scan and Sink: first, the Scan process parses the original data, and then the Sink process converts the data Organize and distribute to downstream BEs via RPC. After the downstream BE receives the data, it first batches the data in the memory structure MemTable, sorts and aggregates the data, and finally downloads it into a data file (also called a segment file) to the hard disk for persistent storage.

performance1.png

However, in our actual data import process, the following problems may occur:

  • Because the RPC between the upstream BE and the downstream BE adopts the Ping-Pong mode, that is, the upstream BE will send the next request only after the downstream BE finishes processing a request and replies to the upstream BE. If the downstream BE spends a long time in the process of MemTable processing, the upstream BE will wait for the RPC to return for a longer time, which will affect the efficiency of data transmission.
  • When importing data to a multi-copy table, the MemTable process needs to be repeated on each copy. However, this method consumes a certain amount of memory and CPU resources on the node where each copy is located. Not only that, but the lengthy processing process will also affect the execution efficiency.

In order to solve the above problems, we have just released the Apache Doris 2.0 version ( https://github.com/apache/doris/tree/2.0.1-rc04 ), for the batching, sorting and placing of MemTable during the import process And other processes are optimized to improve the efficiency of data transmission between upstream and downstream. In addition, we also provide a "single-copy import" data distribution mode in the new version . When faced with multi-copy data import, there is no need to repeat the MemTable work on multiple BEs, which effectively improves the utilization of cluster computing and memory resources. Improve the overall throughput of imports.

MemTable Optimization


01 Write optimization

In the past version of Apache Doris, when the downstream BE writes to the MemTable, in order to maintain the order of the Key, it will update the SkipList in real time. For the Unique Key table or Aggregate Key table, when an existing Key is encountered, the aggregation function will be called and merged. However, these two steps may consume more processing time, thereby delaying the RPC response time and affecting the efficiency of data writing.

performance2.png

So we optimized this process in version 2.0. When the downstream BE writes to the MemTable, it no longer maintains the order of the keys in the MemTable in real time , but defers the guarantee of the order until the MemTable is about to be flushed into a segment. In addition, we replaced std::sort with more efficient pdqsort, implemented a cache-friendly column-first sorting method, and achieved better sorting performance. The above two methods are used to ensure that the RPC can be responded in time.

02 Parallel brush down

During the import process, when the downstream BE writes a MemTable to a certain size, it will flash the MemTable as a Segment data file to persist the data and release the memory. In order to ensure that the Ping-Pong RPC performance mentioned above is not affected, the flushing operation of MemTable will be submitted to a thread pool for asynchronous execution.

In previous versions of Apache Doris, for tables with Unique Keys, the task of refreshing MemTable was performed serially, because there may be duplicate Keys between different Segment files, serial execution can maintain their sequence, and the Segment serial number It is assigned when the next refresh task is scheduled for execution. At the same time, when the number of tablets is too small to provide enough concurrency, serial downsizing may cause the system's IO resources to be unable to be reused. In version 2.0 of Apache Doris, since we post-ordered the sorting and aggregation operations of the Key, in addition to the original IO load, the CPU load was also increased in the flushing task (that is, the post-ordering and aggregation operations). If you still use the serial downloading method at this time, when there are not enough tablets to guarantee the number of concurrency, the CPU and IO will alternately become the bottleneck, resulting in a significant decrease in the throughput of the downloading task.

To solve this problem, we assign Segment numbers to the download task when submitting it to ensure that the sequence of segment files generated after parallel download is correct. At the same time, we also optimized the subsequent Rowset construction process so that it can handle discontinuous segment numbers. Through the above improvements, all types of tables can be flushed to MemTable in parallel, thereby improving the overall resource utilization and import throughput.

03 Optimization effect

Through the optimization of MemTable, in the face of different import scenarios, the throughput of Stream Load has been improved by different degrees (see the following for detailed comparison data). This optimization is not only applicable to Stream Load, but also to other import methods supported by Apache Doris, such as Insert Into, Broker Load, S3 Load, etc., which improve the efficiency and performance of import to varying degrees.

single copy import


01 Principle and Implementation

In the past version, when faced with multi-copy data writing, each data copy of Apache Doris needs to be sorted and compressed on its own node, which will cause a large resource occupation. In order to save CPU and memory resources, we have provided the ability of single copy import in Apache Doris version 2.0, which will select one copy from multiple copies as the master copy (other copies are slave copies), and only for the master copy Perform calculations, and when the data files of the master copy are all successfully written, notify the node where the slave copy is located to directly pull the data file of the master copy to achieve data synchronization between the copies, and return after all slave copy nodes have been pulled or timeout (Success is returned if most replicas succeed). This ability does not need to be processed on the nodes one by one, reducing the pressure on the nodes, and the saved computing power and memory will be used for processing other tasks, thereby improving the concurrent throughput of the overall system.

Performance 3. jpeg

02 How to open

FE configuration:

enable_single_replica_load = true

BE configuration:

enable_single_replica_load = true

environment variable (insert into)

SET  experimental_enable_single_replica_insert=true;

03 Optimization effect

  • For single concurrent import , single copy data import can effectively reduce resource consumption. The memory occupied by single-copy import is only  1/3 of that of three-copy import (only one copy of memory is required for single-copy import, and three copies of memory are required for three-copy import). At the same time, it can be seen from the actual test that the CPU consumption of single-copy import is about  1/2 of that of three-copy import , which can effectively save CPU resources.
  • For multiple concurrent imports , under the same resource consumption, single copy import can significantly increase task throughput. At the same time, in the actual test, for the same concurrent import task, the three-copy import method took 67 minutes, while the single-copy import method only took 27 minutes, and the import efficiency was increased by about 2.5 times . For specific data, please refer to the following text.

performance comparison


Test environment and configuration:

  • 3 BEs (16C 64G), each BE is configured with 3 disks (single disk reads and writes about 150 MB/s)
  • 1 FE, sharing one of BE's machines

The original data uses the Lineitem table generated by TPC-H SF100, and is stored on an independent disk of the machine where the FE is located (read about 150 MB/s).

01 Stream Load (single concurrent)

performance4.png

Based on the single-concurrency scenarios listed above, the overall import performance of Apache Doris 2.0 is  2-7 times higher than that of version 1.2.6 ; under the premise of multiple copies, the new feature single-copy import is enabled, and the import performance is improved by  2-8 times .

02 INSERT INTO (multiple concurrency)

performance5.png

Taking the multi-concurrency scenarios listed above, Apache Doris version 2.0 has a slight improvement compared to version 1.2.6 as a whole; after enabling the new feature of single-copy import, the effect of importing performance on multiple copies is significantly improved, and the import speed is faster than that of version 1.2.6 Increased by about 50% .

conclusion


The community has been committed to improving the core capability of Apache Doris import performance to provide users with a better and efficient analysis experience. By optimizing the capabilities of Memtable and single-copy import in version 2.0, the import performance has been several times higher than that of the previous version. promote. In the future, we will continue to iterate in version 2.1, combining the optimization method of MemTable, the concept of single-copy optimization resource energy efficiency, and the optimized IO model and streamlined IO path based on Streaming RPC to further optimize the import performance and reduce the impact of import on query performance To provide users with a more excellent data import experience.

# about the author:

Kaijie Chen, Senior R&D Engineer of SelectDB

Zhengyu Zhang, Senior R&D Engineer of SelectDB

The third-year junior high school student wrote the web version of Windows 12 deepin -IDE officially debuted, known as "truly independent research and development " . Simultaneously updated", the underlying NT architecture is based on Electron "Father of Hongmeng" Wang Chenglu: The Hongmeng PC version system will be launched next year, and Wenxin will be open to the whole society . Officially released 3.2.0 Green Language V1.0 Officially released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10108213