hbase-jump table

1 Introduction

Skip List (SkipList) is a memory data structure that can efficiently implement insertion, deletion, and search. The expected complexity of these operations is O(logN). Compared with red-black trees and other binary search trees, the advantage of skip table is that it is simple to implement, and the locking granularity is smaller in concurrent scenarios, so that higher concurrency can be achieved. Because of these advantages, jump tables are widely used in KV databases, such as Redis, LevelDB, and HBase, all use jump tables as a basic data structure for maintaining ordered data sets.

As we all know, the query complexity of the linked list data structure is O(N), where N is the number of elements in the linked list. In the case that the element to be deleted has been found, the deletion operation of the linked list is actually very efficient. Just point the next pointer of the element before the element to be deleted to the element after the element to be deleted. The complexity is O(1) , but the problem is that the query complexity of the linked list is too high, because the linked list needs to be searched element by element when querying. If the linked list can avoid searching for elements in sequence when searching, then the search complexity will be reduced. The jump table uses this idea to store additional index information of some nodes on the linked list to avoid searching for elements in sequence, thereby optimizing the query complexity to O(logN). After optimizing the query complexity, the complexity of insertion and deletion is naturally also optimized.

1. Definition

As shown in the figure, the skip table is defined as follows:

·The jump list consists of multiple hierarchical linked lists (set as S0, S1, S2,..., Sn), for example, there are 6 linked lists in the figure.

· The elements in each linked list are ordered.

Each linked list has two elements: +∞ (positive infinity) and -∞ (negative infinity), which represent the head and tail of the linked list, respectively.

·From top to bottom, the upper linked list element set is a subset of the lower linked list element set, that is, S1 is a subset of S0, and S2 is a subset of S1.

· The height of the jump list is defined as the number of layers of the horizontal linked list.

insert image description here
The process of finding a specified element in the jump list is relatively simple. The above figure uses the upper left element (set as currentNode) as the starting point to query the value 5:

·If it is found that the value of the successor node of currentNode is less than or equal to the value to be queried, then query backward along this linked list, otherwise, switch to the next layer linked list of the current node.

·Continue to query until the value to be queried is found (or currentNode is an empty node).
———————————————
insert image description here

2. insert

The insertion algorithm for skip tables is a bit more complicated. As shown below. First, you need to find the predecessor and successor of the element to be inserted according to the above search process; then, generate a height value according to the following random algorithm:

// p is a constant between (0,1), generally take p=1/4 or 1/2

public void randomHeight(double p){

int height = 0 ;

while(random.nextDouble() < p) height ++ ;

return height + 1;

}

Finally, generate a vertical node according to the height value of the node to be inserted (the number of layers of this node is exactly equal to the height value), and then insert it into the multiple linked lists of the jump table. Assuming height=randomHeight(p), there are two situations to discuss here:

·If the height is greater than the height of the jump list, then the height of the jump list is raised to height, and the pointers of the head node and the tail node need to be updated.

·If the height is less than or equal to the height of the jump table, then it is necessary to update the pointers of the predecessor and successor of the element to be inserted.

insert image description here

4. delete

The deletion operation is somewhat similar to the insertion operation, so I won't go into details.

5. Complexity analysis

Here, we evaluate the time and space complexity of skip lists together.

Property 1 The probability that a node falls into the kth layer is P^(k-1).

This property is relatively simple. If the height returned by the randomHeight(p) function is k, then it must be required that the previous (k-1) random numbers are all less than p, and the (k-1) independent event probabilities with probability p are multiplied. So the probability of height k is P^(k-1). Property 2 A jump list with n elements in the bottom linked list, the total number of elements is:

where k is the height of the jump table.

Due to property 1, the probability of an element falling into the k-th layer is p (k-1), then the number of elements inserted in the k-th layer is n×p (k-1), and all k are added to obtain the above formula. When p <= 1/2, the above formula is less than O(2n), so the space complexity is O(n).

Property 3 The height of the jump table is O(logn).

Consider layer 1, the expected number of nodes falling in this layer is

When n is large, the number of nodes in this layer is 0, so the number of layers is at the data level of O(logn).

Property 4 The query time complexity of skip table is O(logN).

The key to the query time complexity is the sum of the number of horizontal and vertical steps taken from the upper left corner to the bottom. We consider this process in reverse, that is, the expected number of steps (including the number of horizontal steps) taken from the bottom to the upper left corner s. For the node in the jth column of the kth layer, it can only jump from the following two situations:

·The node in column j of layer k-1 goes up and jumps to the node in column j of layer k. According to the randomHeight(p) function definition, the probability of going up is p.

·The node in column j+1 of layer k goes to the left and jumps to the node in column j of layer k. In this case, the node in column j+1 of layer k is already the highest point of the vertical node, that is to say, this node can no longer go up, but can only go left. According to the randomHeight(p) function definition, the probability of going left is (1-p).

Let Ck be the expected number of steps to jump up to the k layer (including the number of vertical steps and the number of horizontal steps), then:

Since the height k is O(logN) level, the expected number of steps taken by the query is also O(logN).

Property 5 The time complexity of inserting/deleting a skip table is O(logN).

It can be seen from the implementation of insertion/deletion that the time complexity of insertion/deletion is equal to the time complexity of query, so property 5 holds.

Therefore, the complexity of lookup, deletion, and insertion of the jump table is O(logN).

Guess you like

Origin blog.csdn.net/chuige2013/article/details/129544879