Paimon multi-way merge optimization based on LoserTree

Abstract: In multi-way merge sorting, the number of comparisons has a great impact on the overall sorting time. This article mainly introduces the design idea of ​​using LoserTree to replace the heap sorting algorithm in the multi-way merge implementation of Paimon SortMergeReader to reduce the number of multi-way merge comparisons and the performance benefits obtained. It mainly includes the following aspects:
  1. Background introduction: Introduce the principle and optimization ideas of reading data in Paimon;
  2. Multi-way merge algorithm: introduce the implementation principles of heap sort and LoserTree, and analyze and compare the complexity of the algorithms;
  3. Scheme design: analyze the problems existing in using LoserTree in Paimon, and propose an optimized implementation based on LoserTree;
  4. Algorithm proof: The correctness analysis and proof of the new implementation algorithm are carried out;
  5. Performance benefits : Introduce the performance benefits obtained through benchmark tests after the overall implementation is implemented .

1. Background

In Paimon's SortMergeReader, the Key-Value returned by multiple RecordReaders will be read, and the same Key will be merged using MergeFunction, where the data of each RecordReader is ordered. The entire reading process is actually a multi-way merge of data from multiple RecordReaders. During the merging process, the more data are compared, the more time-consuming the overall sorting will be.
The algorithms for multi-way merge mainly include heap sort, winner tree and loser tree. Among these three algorithms, heap sorting needs to compare with the left and right child nodes every time heap adjustment is performed, and the number of comparisons is 2logN, while the number of comparisons when adjusting the winner tree and the loser tree is logN. The difference is that the winner tree needs to compare with sibling nodes and update the parent node, while the loser tree only needs to compare with the parent node, and the number of memory accesses is less. At present, Paimon uses heap sorting to implement SortMergeReader by default, so consider using LoserTree to reduce the number of comparisons, and reduce the number of comparisons when reading a large amount of data, thereby improving performance.

2. Introduction of multi-way merge algorithm

The multi-way merge algorithm is mainly used for external sorting, mainly in accordance with the sort-merge strategy. When the amount of data to be processed is very large and the memory cannot be fully loaded, the data will be organized into multiple ordered sub-files, and then these sub-files will be merged. In Paimon, each RecordReader is already ordered, so we only need to do the merge process operation . The following will mainly introduce the heap sort and the LoserTree algorithm, and analyze and compare the performance between the two.

2.1 Heap sort

Heap sort is an algorithm designed with the heap as the sorted data structure. The heap is a complete binary tree . According to whether the value stored in the parent node is greater than or less than the value of the child node, it is divided into a large root heap and a small root heap. Taking the small root heap as an example, the sorting process is divided into two processes: heap building and heap adjustment. During the entire sorting process, if data exchange occurs after the parent-child node is compared, a top-down adjustment will occur, and this adjustment needs to be compared with two child nodes at the same time every time.
1. Build a heap
Assuming that there are 5 columns to be sorted, the first step is to adjust the 5 columns to be sorted into small root piles according to the size of the head element, and the order of adjustment is bottom-up.
1) First adjust the Node4 node;
2) Then adjust the Node3 node;
3) When adjusting the Node2 node, since it is larger than the parent node Node0, no adjustment is required;
4) Continue to adjust Node1 . Since Node1 is smaller than Node0 , it needs to be exchanged with Node0 first, and then continue to adjust downward. So far, the small root heap has been constructed.
2. Heap Tuning
Every time when sorting, the current smallest data will be taken out from the head node, and the next element of the corresponding sequence will be placed in the head node, and then adjusted continuously from top to bottom. Each downward adjustment needs to be compared with the left and right child nodes at the same time to select the minimum value.
3. Complexity Analysis
Assuming that the number of columns to be sorted is N and the total number of elements to be sorted is n, then:
1) The space complexity is O(N) ;
2) The time complexity of the overall sorting is O(nlogN) ;
3) The time complexity of a single adjustment is O(logN). Since it needs to be compared with both child nodes, the number of comparisons for a single adjustment is 2logN .

2.2 LoserTree

LoserTree is also a data structure commonly used in merge sort algorithms , and it is also a complete binary tree . In this complete binary tree, the leaf nodes represent the columns to be sorted, and the non-leaf nodes represent the loser of the two child nodes. For Node0, it represents the global Winner . Compared with heap sorting, LoserTree can simplify the tree adjustment process. Since the intermediate node records the loser of the last comparison, this loser is also equivalent to the local winner from the node to the corresponding leaf node subtree. In this way, each readjustment only needs to be compared with the parent node from bottom to top to obtain a new global Winner . Similar to heap sorting, the sorting process of LoserTree is divided into two processes: tree initialization and tree adjustment.
1. Tree initialization
The initialization process of LoserTree is also carried out from the bottom up, from the back to the front, the loser becomes the intermediate node, and the winner continues to compare upwards.
1) Adjust the leaf node Leaf4, since the parent node does not currently have a loser, it is set to Leaf4;
2) Adjust the leaf node Leaf3 and compare it with the loser Leaf4 recorded in the parent node. Leaf3 wins and continues upward. Since node Node2 has no loser for the time being, it is set to Leaf3.
3) Adjust the leaf node Leaf2, which is similar to the leaf node Leaf4, and set the loser of the parent node to Leaf2;
4) Continue to adjust the leaf node Leaf1, compare it with the loser Leaf2 recorded in the parent node , Leaf1 wins, continue to move up, and set the loser of Node1 to Leaf1.
5) Finally, adjust the leaf node Leaf0 and compare it with the loser Leaf3 recorded in the parent node. Leaf3 wins, and set the loser of the node to Leaf0. Leaf3 continues to compare with the loser Leaf1 in Node1 , and finally Leaf3 wins, and the global winner in Node0 is updated to Leaf3. So far, the initialization process of LoserTree is over.
2. Tree adjustments
Similar to heap sorting, a piece of data is fetched from the head node Move the corresponding leaf node to be sorted by one element, and then continuously compare from bottom to top until it reaches the head node, and obtain a new global winner.
3. Complexity Analysis
Assuming that the number of columns to be sorted is N and the total number of elements to be sorted is n, then:
1) The space complexity is O(N) ;
2) The time complexity of the overall sorting is O(nlogN) ;
3) The time complexity of a single adjustment is O(logN). Each adjustment only needs to be compared with the parent node, and the number of comparisons for a single tree adjustment is logN .

2.3 Algorithm comparison

According to the complexity analysis of the two algorithms introduced above, the space complexity and time complexity of the two algorithms are the same. The difference is the difference in the number of comparisons. When adjusting the tree, the adjustment process of LoserTree is simpler. In theory, LoserTree can reduce the number of comparisons by half compared with heap sorting. When the overhead of element comparison is relatively high, the benefits brought by reducing the number of comparisons are obvious. Therefore, in the implementation of the subsequent optimization scheme, we chose LoserTree as the basic data structure for sorting.

3. LoserTree optimization scheme

In the conventional implementation of LoserTree, it is only necessary to initialize the LoserTree, take out the global Winner continuously from the top of the tree , and then adjust the tree from bottom to top. In Paimon, SortMergeReader needs to completely merge the same UserKey before returning, but the same RecordReader will reuse the Java object for data return, and the previously returned object may be cached in the MergeFunction, so we cannot directly iterate the RecordReader to the next data when performing tree adjustment, which will affect the previously returned object. Although this problem can be solved by using methods such as deep copy, the overhead of copying is too large, and even negative effects are produced.
Therefore, it is necessary to provide a variant implementation of LoserTree: After each round of merging of the same UserKey is completed, data iteration is performed on RecordReader.

3.1 Preconditions

  1. Each Key in Paimon consists of two parts: UserKey + SequenceNumber ;
  2. The data in each RecordReader is ordered, and a single RecordReader does not contain the same UserKey .

3.2 Initialization

Consistent with the initialization method of the regular LoserTree, the LoserTree is built from the bottom up, the loser becomes the intermediate node, and the winner continues to compare upwards.

3.3 Sorting

When adjusting the tree, due to the problem of object reuse, we cannot directly iterate the RecordReader to the next data. We need to mark the data first, similar to setting the SequenceNumber to infinity, and then adjust from bottom to top, so that all nodes with the same UserKey can be accessed eventually . Every time a tree is adjusted, the cost of UserKey comparison is relatively high. In the process of adjusting LoserTree before, the node with the same UserKey as the node to be adjusted has already been compared, and the previous comparison result can be directly reused. Therefore, a state machine is introduced to perform state transition during node comparison to avoid repeated comparisons.
  • state definition
A total of 6 states are defined to represent nodes in different states.
  1. WINNER_WITH_NEW_KEY : Use a different UserKey from the last global Winner ;
  2. WINNER_WITH_SAME_KEY : use the same UserKey as the last global Winner , but the SequenceNumber is larger;
  3. WINNER_POPPED : The global Winner has been taken out and processed, and it is also used to determine whether there are unprocessed same UserKey nodes in the tree ;
  4. LOSER_WITH_NEW_KEY : Does not have the same UserKey as the last Local Winner that defeated it ;
  5. LOSER_WITH_SAME_KEY : It has the same UserKey as the last Local Winner that defeated it ;
  6. LOSER_POPPED : It has the same UserKey as the last global Winner , and has been taken out for processing;
  • state transition
When two nodes compare and perform state transition, they follow the following rules:
  1. The new Key generated by each leaf node iteration, the state is initialized to WINNER_WITH_NEW_KEY;
  2. When the head node of the tree is taken out, the state of the corresponding leaf node is switched to WINNER_POPPED, which can be regarded as the UserKey unchanged, but the SequenceNumber is set to infinity;
  3. According to the state of Local Winner , different state transitions will be performed when encountering parent nodes of different states :
  • The status of L ocal W inner is WINNER_WITH_NEW_KEY , and the status of the parent node is:
    • LOSER_WITH_NEW_KEY: Two nodes need to compare and calculate a new Winner ; if the UserKey of the two nodes is the same, the state of the loser node will be converted to LOSER_WITH_SAME_KEY;
    • LOSER_WITH_SAME_KEY: This is an impossible case , because WINNER_WITH_NEW_KEY means that a new round of adjustment is started, so all nodes with the same UserKey as the last global Winner should be processed ;
    • LOSER_POPPED: No need to compare, the parent node wins and switches to WINNER_POPED, and the child node switches to LOSER_WITH_NEW_KEY .
  • The state of the Local Winner is WINNER_WITH_SAME_KEY , and the state of the parent node is: 
    • LOSER_WITH_NEW_KEY: No need to compare and switch states, the child node wins; 
    • LOSER_WITH_SAME_KEY: The UserKeys of the two nodes are the same, only the SequenceNumbers of the two nodes need to be compared , which can reduce the comparison overhead. The winner switches to WINNER_WITH_SAME_KEY, and the loser switches to LOSER_WITH_SAME_KEY ;
    • LOSER_POPPED: No need to compare and transition state, child wins.
  • The status of L ocal W inner is WINNER_POPPED , and the status of the parent node is:
    • LOSER_WITH_NEW_KEY: No need to compare and switch states, the child node wins ;
    • LOSER_WITH_SAME_KEY: No need to compare, the parent node wins and switches the state to WINNER_WITH_SAME_KEY, and the state of the child node switches to LOSER_POPPED ;
    • LOSER_POPPED: No need to compare and transition state, child wins.

3.4 Optimization

A variant implementation of LoserTree can be obtained according to the above algorithm, but each time a piece of data is extracted from the head node, no matter whether there is still the same UserKey node that has not been extracted in the current tree, this node needs to be readjusted from bottom to top. In extreme cases, when there is no repeated UserKey node in the entire tree, we need to do two tree adjustments after each global Winner is taken out : 1) set the SequenceNumber to infinity; 2) iterate the data of RecordReader backward once. In this way, the performance of LoserTree may be worse than heap sorting.
By adding the FirstSameKeyIndex field in the leaf node , it is used to record the node position of the same UserKey that we won for the first time , so that we can quickly distinguish whether there is the same unprocessed UserKey node in the tree. If so, we can directly replace the status of these two nodes and adjust upward from this position, thereby reducing the number of adjustment layers.

4. Algorithm Proof

In Paimon, each iteration of LoserTree will merge all the same UserKey , and then iterate the corresponding RecordReader. Therefore, we only need to prove that all the data of the same UserKey in this round will be returned.
Theory : When the FirstSameKeyIndex of the global Winner is -1 , there is no unprocessed node with the same UserKey as the global Winner in the tree .
Proof : According to the definition of LoserTree, any of its subtrees are LoserTree. Assume that the current global Winner comes from leaf node A, and there is a leaf node B in the tree that has the same userKey as the global Winner but has not been processed yet. The nearest common ancestor of A and B is node C, from C's left and right subtrees, respectively.
It is known that node A must participate in the comparison of node C, since node B and node A have the same minimum UserKey , then node B will either become the winner of the right subtree , or be defeated by the node with the same UserKey . The final winner of the right subtree of node C must be the node with the same UserKey as node A, so the FirstSameKeyIndex of node A cannot be -1. This proves that when the FirstSameKeyIndex of the global Winner is -1 , there will not be an unprocessed node with the same UserKey of the global Winner in the tree .

5. Performance benefits

Based on the JMH framework, we conducted a reading performance benchmark test on different numbers of RecordReaders and different data volumes where the UserKey is an Integer and a 128-bit String type. The overall performance of LoserTree is better than that of heap sorting. The more complex the type of UserKey, the higher the cost of comparison and the more obvious the optimization effect .
  • Test environment : Docker image uses Apache / F link : 1.16.1-java8, CPU configuration 4 cores, memory configuration 8G,
  • Test results : When the UserKey is a simple type Integer, the optimization effect is about 10%, and when the UserKey is a 128 -bit String type, the performance can be improved by 30% to 50%.
Integer type userKey
128-bit String type userKey

* quote

  1. Apache Paimon official website: https://paimon.apache.org/
  2. Apache Paimon DingTalk communication group: 10880001919

* Author information

Li Ming, Bytedance Infrastructure Engineer, Apache Flink & Paimon Contributor
 
RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? The CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released Programmer's Notes CherryTree 1.0.0.0 is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5941630/blog/8903333