Data structure ~ Summarize the characteristics of B-tree, B-tree, B+ tree, and B* tree

Preface

The storage structure of data in the computer is mainly sequential storage structure, chain storage structure, index storage structure, and hash storage structure.

  • Sequential storage is generally our common data. It is good to store a small amount of data. If a large amount of data is stored, there may be insufficient memory, and it is time-consuming to move and delete the data once it needs to be moved.
  • Chain storage improves the above shortcomings, but the search efficiency of chain storage is very low. Especially the scope view is very low
  • Hash lookup is very fast, especially the idea of ​​using Hash can find the corresponding data in almost O(1) complexity, but it is still not friendly to range lookup
  • Index search is the tree + chain structure we are going to talk about today. It uses some tree hierarchical characteristics + chain structure to find the corresponding data very quickly, and the use of chain knots perfectly supports range search

B tree (binary search tree)

Before talking about B-trees, let’s learn about binary trees.

  • Characteristics of Binary Tree
  1. At least one node (root node)

  2. Each node has at most two subtrees, that is, the degree of each node is less than 3.

  3. The left subtree and the right subtree are in order, and the order cannot be reversed arbitrarily.

  4. Even if a node in the tree has only one subtree, it is necessary to distinguish whether it is a left subtree or a right subtree.
    Insert picture description here

The B-tree adds some characteristics to the specificity of the binary tree.

  • All nodes store a keyword;
  • The value of the root node is greater than the value of any node in its left subtree and less than the value of any node in its right subtree, and this rule applies to every node in the tree.
    Insert picture description here
    The query efficiency of B-tree is between O(log n)~O(n). In an ideal sorting situation, the query efficiency is O(log n). In extreme cases, the efficiency of element search is equivalent to that of linked list query O(n).

Insert picture description here
The B-trees actually used are based on the original B-trees plus a balancing algorithm, that is, "balanced binary search tree", also known as AVL tree

How to maintain a balanced distribution of B-tree nodes is the key to balancing a binary tree; the balancing algorithm is a node rotation strategy when inserting and deleting nodes in the B-tree.

  • Features of AVL tree

It has the characteristics of a binary search tree (any node of the left subtree is smaller than the parent node, and any node of the right subtree is greater than the parent node). The left and right subtrees of any node are balanced binary trees (left and right of any node). The height difference of the subtree is less than 1, that is, the balance factor is in the range [-1,1])

  • Node insertion, rotation

The insert node of the AVL tree is as follows:

Traverse up from the new node to check the balance factor of each node. If it is found that the balance factor of a node is not in the range of [-1,1] (ie, the unbalanced node u), then rebalance the subtree rooted at u by rotating

Rotation method:

Left rotation: used to balance the RR situation, and perform left rotation on the unbalanced node u (unbalanced) and the subtree

Right rotation: used to balance the LL situation, right-rotate the unbalanced node u and subtree

Left and right rotation: used to balance the LR situation, left-handed the left child node ul of the unbalanced node u, and then right-handed the unbalanced node u

Right-left rotation: used for balance, rotate right-hand the right child node ur in the unbalanced direction of the unbalanced node u, and then rotate left-handed the unbalanced node u

  • Node deletion steps

Choose balance, this step is not much different from inserting, traverse up from the deleted node to check the balance factor of each node, if a node is found to be out of balance, rebalance the subtree rooted at this node by rotating

We know that there is another kind of balanced tree called the red-black tree, I simply sorted out some comparisons

The AVL tree is more balanced than the red-black tree. Because the maximum balance difference of the AVL tree must not be greater than 1, all search efficiency can be guaranteed to be in O(logn), but the red-black tree maintains a balance based on the value and color of the node, that is The color can be regarded as a balance factor, so even if the height difference between the left and right subtrees is >=2, it does not necessarily rotate in order to maintain balance like the AVL tree.
Therefore, the AVL tree may cause more rotations during the insertion and deletion process, which may cause a certain amount of time consumption.
From the point of view of node value, it is naturally more balanced than the red-black tree, and the efficiency of AVL is higher in value search, but when there are more insertions and deletions, the AVL tree rotation operation will be more than the red-black tree, and the efficiency is naturally slower.
Therefore, if you apply The program involves many frequent insertion and deletion operations, the Red Black tree (such as HashMap in Java 1.8) should be preferred. If the frequency of insert and delete operations is low, and the frequency of search operations is high, the AVL tree should take precedence over the red-black tree.

B-tree

Most self-balancing search trees (such as AVL and red-black trees) assume that all data is in main memory, but we must consider a large amount of data that cannot be accommodated in main memory. When the number of keys is large, the data will be read from the disk in blocks, and the disk access time is very high compared to the main memory access time.

B-tree is a multi-path search tree (not binary), the main idea of ​​the design is to reduce the number of disk accesses. Most tree operations (add, delete, check, maximum, minimum, etc.) require O(h) disk access, where h is the height of the tree.
The B-tree keeps the height of the B-tree low by placing the largest possible key in the node. Generally, the size of the B-tree node remains equal to or N times the size of the disk block. Due to the low height of B-trees, compared with B-trees and balanced binary search trees (such as AVL trees, red-black trees, etc.), the number of disk accesses for most operations is significantly reduced.

  • The main features of B-tree (the value of M is mainly related to the size of the node set and the size of the stored key):

1. Define that any non-leaf node has at most M children; and M>2;

2. The number of sons of the root node is [2, M];

3. The number of sons of non-leaf nodes other than the root node is [M/2, M];

4. Each node stores at least M/2-1 (rounded up) and at most M-1 keywords; (at least 2 keywords)

5. The number of keywords of non-leaf nodes = the number of pointers to the son -1;

6. The keywords of non-leaf nodes: K[1], K[2], …, K[M-1]; and K[i] <K[i+1];

7. Pointers of non-leaf nodes: P[1], P[2], …, P[M]; where P[1] points to a subtree whose key is less than K[1], and P[M] points to a key Subtrees greater than K[M-1], other P[i] points to subtrees whose keywords belong to (K[i-1], K[i]);

8. All leaf nodes are located at the same level;
Insert picture description here
B-tree search starts from the root node and performs binary search on the key (ordered) sequence in the node, if it hits, it ends, otherwise it enters the scope of the query key The son node of; repeat until the corresponding son pointer is empty or is already a leaf node;

Characteristics of B-tree:

1. The keyword set is distributed in the whole tree;

2. Any keyword appears and only appears in one node;

3. The search may end at a non-leaf node;

4. Its search performance is equivalent to a binary search in the complete set of keywords;

5. Automatic level control;

Among them, M is the maximum number of subtrees of non-leaf nodes, and N is the total number of keywords;

Therefore, the performance of B-tree is always equivalent to binary search (it has nothing to do with the value of M), and there is no problem of B-tree balance;

Due to the limitation of M/2, when inserting a node, if the node is full, the node needs to be split into two nodes each occupying M/2; when deleting a node, two nodes less than M/2 are required. The sibling nodes are merged, which means that there will be paging and page operations.

Therefore, the B-tree has made the following optimizations on the premise of retaining the pre-division range of the binary tree to improve query efficiency:

The binary tree becomes an m-ary tree. The size of this m can be adjusted according to the size of a single page, so that a page can store more data, and read a page from the disk can read more data, random IO The number of times is reduced, and the efficiency is greatly improved.

B+ tree

B+ tree is a variant of B-tree, and it is also a multi-path search tree.

1. Its definition is basically the same as B-tree
2. The number of subtree pointers of
non-leaf nodes is the same as the number of keywords 3. The subtree pointer P[i] of non-leaf nodes points to the key value belonging to [K[i ], K[i+1]) subtree (B-tree is an open interval)
5. Add a chain pointer to all leaf nodes
6. All keywords appear in leaf nodes
Insert picture description here

When implementing dynamic multi-level indexing, the data structure of B-tree and B+ tree is usually adopted. However, the B-tree has a disadvantage that it stores the data pointer corresponding to a specific key value (pointer to the disk file block containing the key value) and the key value in the node of the B-tree. This design greatly reduces the number of entries that can be compressed into the B-tree node, thereby increasing the number of levels in the B-tree and the search time for records.

The B+ tree eliminates the above-mentioned shortcomings of the B-tree by only storing data pointers at the leaf nodes of the tree, so the leaf node structure of the B+ tree is completely different from the internal node structure of the B-tree. Data pointers only exist in leaf nodes in the B+ tree, so leaf nodes must store all key values ​​and their corresponding data pointers to disk file blocks for access. In addition, leaf nodes are also used for links to provide orderly access to records. Therefore, the leaf node is the first-level index, and the internal node is only a multi-level index indexed to other levels. Some key values ​​of leaf nodes also appear in internal nodes, mainly as a medium to simplify search records.

Compared with the B-tree with the same level, the B+ tree with the same level can store more keys in its internal nodes , significantly improving the search time for any given keyword, the same number of keys B+ tree level The existence of a low pointer P that points to the next node makes the B+ tree very fast and effective when accessing records from disk. For example, suppose that the internal nodes of a certain level of B-tree and B+ tree have a capacity of 100K. Since the nodes of B-tree save keys and data pointers, the actual storage key capacity may not even have half of 50K, but the B+ tree’s 100K The capacity is used to store the key, so the index is naturally more efficient.

The difference between B-tree and B+ tree

1. B-tree non-leaf nodes and leaf nodes store data, so when querying data, the time complexity is O(1) at best and O(log n) at worst.
The B+ tree only stores data in leaf nodes, and non-leaf nodes store keywords, and the keywords of different non-leaf nodes may be repeated. Therefore, when querying data, the time complexity is fixed to O(log n).

2. The leaf nodes of the B+ tree are connected to each other by a linked list, so a traversal operation can be completed only by scanning the linked list of the leaf nodes, and the B tree can only be traversed in the middle order.

B+ tree adds the following optimizations on the basis of B- tree

1. The leaf nodes add pointers to connect, that is, a linked list is formed between the leaf nodes;

2. Non-leaf nodes only store the keyword key, no longer store data, only store data in leaf nodes;

Note: The advantage of using a doubly linked list connection between leaves is more than a singly linked list connection. Any node in the linked list can be traversed forward or backward to find other nodes specified in the linked list.

The advantages of this are:

1. In range query, orderly traversal can be performed by accessing the linked list of leaf nodes, instead of in-order backtracking to access nodes.

2. Non-leaf nodes only store the keyword key. On the one hand, this structure is equivalent to dividing out more scopes and speeding up the query speed. On the other hand, it is equivalent to reducing the size of a single index value, and the same page can store more With more keywords, more keywords can be obtained by reading a single page, the searchable range becomes larger, and the relative IO read and write times are reduced.

Why is B+ tree more suitable for database indexing than B-tree?

The B+ tree is more adapted to the characteristics of the disk and reduces the number of I/O reads and writes compared to the B tree. Because the index file is very large, the index file is stored on the disk. The non-leaf nodes of the B+ tree only store keywords but not data. Therefore, a single page can store more keywords, which is the key to find that needs to be read into the memory at a time. The more words there are, the fewer random I/O reads from the disk.

The query efficiency of the B+ tree is more stable than that of the B tree. Since the data only exists on the leaf nodes, the search efficiency is fixed at O(log n).

The leaf nodes of the B+ tree are connected in order by a linked list, so scanning all the data only needs to scan the leaf nodes once, which is conducive to scanning the database and range query; the B tree can only be traversed through the middle order because the non-leaf nodes also store data. Scan in order. In other words, for range queries and ordered traversals, B+ trees are more efficient.

B*tree

It is a variant of the B+ tree, adding pointers to brothers at the non-root and non-leaf nodes of the B+ tree

Insert picture description here
B* tree defines that the number of non-leaf node keywords is at least (2/3)*M, that is, the minimum usage rate of the block is 2/3 (instead of 1/2 of the B+ tree);

  • Splitting of the B+ tree: when a node is full, a new node is allocated, 1/2 of the data in the original node is copied to the new node, and finally the pointer of the new node is added to the parent node; B+ The split of the tree only affects the original node and the parent node, but does not affect the sibling node, so it does not need a pointer to the sibling;

  • Splitting of the B* tree: When a node is full, if its next sibling node is not full, move part of the data to the sibling node, then insert the keyword in the original node, and finally modify the sibling in the parent node The key of the node (because the key range of the brother node has changed); if the brother is also full, add a new node between the original node and the brother node, and copy 1/3 of the data to the new node. Point, and finally add the pointer of the new node in the parent node;

When B* tree splits and merges pages, it will give priority to whether the sibling node is not full, so the probability of assigning new nodes is lower than that of B+ tree.So the space usage rate is higher;

Guess you like

Origin blog.csdn.net/Shangxingya/article/details/114916278