Database common index analysis (B tree, B-tree, B+ tree, B* tree, bitmap index, Hash index)

B-tree

 

       That is, a binary search tree:

       1. All non-leaf nodes have at most two children (Left and Right);

       2. All nodes store a keyword;

       3. The left pointer of a non-leaf node points to a subtree smaller than its key, and the right pointer points to a subtree larger than its key;

       Such as:

       

 

       The search of the B-tree starts from the root node. If the query keyword is equal to the node keyword, it will hit; otherwise, if the query keyword is smaller than the node keyword, it will enter the left child; If the keyword is large, enter the right son; if the pointer of the left son or the right son is empty, it will report that the corresponding keyword cannot be found;

       If the number of nodes in the left and right subtrees of all non-leaf nodes of the B-tree remains similar (balanced), then the search performance of the B-tree is close to binary search; but its advantage over binary search in contiguous memory space is that changing the B-tree Structures (insertion and deletion of nodes) do not need to move large segments of memory data, even often with constant overhead;

       Such as:

      

 

   But B-trees may result in different structures after multiple insertions and deletions:

  

   The right side is also a B-tree, but its search performance is already linear; the same set of keywords may lead to different tree structure indexes; therefore, when using a B-tree, consider keeping the B-tree as much as possible in the structure of the left image, and avoiding the structure of the right-hand diagram, the so-called "balance" problem;      

       The B-tree actually used is based on the original B-tree plus a balancing algorithm, that is, a "balanced binary tree"; how to keep the balance of the B-tree nodes evenly distributed is the key to balancing the binary tree; the balancing algorithm is a B-tree in the Strategies for inserting and deleting nodes. Common balanced binary trees are: AVL, RBT, Treap, Splay Tree.

B-tree

is a multiway search tree (not binary):

       1. Define that any non-leaf node has at most M children; and M>2;

       2. The number of sons of the root node is [2, M];

       3. The number of sons of non-leaf nodes other than the root node is [M/2, M];

       4. Each node stores at least M/2-1 (rounded up) and at most M-1 keywords; (at least 2 keywords)

       5. The number of keywords of non-leaf nodes = the number of pointers to sons - 1;

       6. Keywords for non-leaf nodes: K[1], K[2], …, K[M-1]; and K[i] < K[i+1];

       7. Pointers of non-leaf nodes: P[1], P[2], …, P[M]; where P[1] points to the subtree whose keyword is less than K[1], and P[M] points to the keyword For subtrees greater than K[M-1], other P[i] point to subtrees whose keywords belong to (K[i-1], K[i]);

       8. All leaf nodes are located at the same layer;

       Such as: (M=3)

  The search of the B-tree starts from the root node, and performs binary search on the (ordered) sequence of keywords in the node. If it hits, it ends, otherwise it enters the child node of the range to which the query keyword belongs; repeat until the corresponding The child pointer of is empty, or is already a leaf node;

Features of B-trees:

       1. The keyword set is distributed in the whole tree;

       2. Any keyword appears and only appears in one node;

       3. The search may end at a non-leaf node;

       4. Its search performance is equivalent to doing a binary search in the complete set of keywords;

       5. Automatic level control;

       Since the non-leaf nodes other than the root node are restricted, at least M/2 children are included, which ensures the at least utilization of the node, and its lowest search performance is:

Among them, M is the maximum number of subtrees of non-leaf nodes set, and N is the total number of keywords;

       Therefore, the performance of B-tree is always equivalent to binary search (independent of M value), and there is no problem of B-tree balance;

       Due to the limitation of M/2, when inserting a node, if the node is full, the node needs to be split into two nodes each occupying M/2; when deleting a node, two nodes less than M/2 need to be split Brother node merge;

B+ tree

  A B+ tree is a variant of a B-tree, which is also a multi-way search tree:

       1. Its definition is basically the same as B-tree, except:

       2. The number of subtree pointers of non-leaf nodes is the same as the number of keywords;

       3. The subtree pointer P[i] of the non-leaf node points to the subtree whose key value belongs to [K[i], K[i+1]) (B-tree is an open interval);

       5. Add a chain pointer to all leaf nodes;

       6. All keywords appear in leaf nodes;

       Such as: (M=3)

The B+ search is basically the same as the B-tree, the difference is that the B+ tree hits only when it reaches the leaf node (B-tree can hit non-leaf nodes), and its performance is also equivalent to doing a binary search in the complete set of keywords;

       Features of B+:

       1. All keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list are just in order;

       2. It is impossible to hit a non-leaf node;

       3. The non-leaf node is equivalent to the index (sparse index) of the leaf node, and the leaf node is equivalent to the data layer that stores (keyword) data;

       4. More suitable for file indexing system;

B*tree

It is a variant of the B+ tree, adding pointers to brothers at the non-root and non-leaf nodes of the B+ tree;

The B* tree defines that the number of non-leaf node keywords is at least (2/3)*M, that is, the minimum usage rate of the block is 2/3 (replacing 1/2 of the B+ tree);

       Splitting of B+ tree: When a node is full, a new node is allocated, 1/2 of the data in the original node is copied to the new node, and finally the pointer of the new node is added to the parent node; B+ The split of the tree only affects the original node and the parent node, not the sibling node, so it does not need a pointer to the sibling;

       Splitting of B* tree: When a node is full, if its next sibling node is not full, then move a part of the data to the sibling node, then insert the keyword in the original node, and finally modify the sibling in the parent node The keyword of the node (because the keyword range of the sibling node has changed); if the sibling is also full, add a new node between the original node and the sibling node, and copy 1/3 of the data to the new node point, and finally add the pointer of the new node at the parent node;

       Therefore, the probability of B* tree assigning new nodes is lower than that of B+ tree, and the space utilization rate is higher;

summary

       B-tree: binary tree, each node stores only one keyword, if it is equal, it is hit, less than the left node, and greater than the right node;

       B-tree: multi-way search tree, each node stores M/2 to M keywords, and non-leaf nodes store child nodes pointing to the range of keywords;

       All keywords appear in the entire tree, and only appear once, and non-leaf nodes can be hit;

       B+ tree: On the basis of the B-tree, add a linked list pointer to the leaf node, all keywords appear in the leaf node, and the non-leaf node is used as the index of the leaf node; the B+ tree always hits the leaf node ;

       B* tree: On the basis of the B+ tree, the linked list pointer is also added to the non-leaf node, and the minimum utilization rate of the node is increased from 1/2 to 2/3

Bitmap index

1. Case

There is a table named table, which consists of three columns, namely name, gender and marital status. The gender is only male and female, and the marital status includes three items: married, unmarried, and divorced. The table has a total of 1,000,000 items. Record. Now there is a query like this: select * from table where Gender='male' and Marital='unmarried';

Name

Gender

Marital Status (Marital)

Zhang San

male

Married

Li Si

Female

Married

Wang Wu

male

unmarried

Zhao Liu

Female

divorce

Sun Qi

Female

unmarried

...

...

...

1) Do not use an index

  When no index is used, the database can only scan all records row by row, and then determine whether the record meets the query conditions.

2) B-tree index

  For gender, the range of possible values ​​is only 'male' and 'female', and male and female may each station 50% of the data in the table. At this time, adding a B-tree index still needs to extract half of the data, so it is completely unnecessary. On the contrary, if the value range of a field is very wide and there is almost no repetition, such as ID number, it is more appropriate to use a B-tree index. In fact, when the retrieved row data occupies most of the data in the table, even if a B-tree index is added, databases such as Oracle and MySQL will not use the B-tree index, and it is very likely that all rows are scanned.

2. Bitmap indexing

If the cardinality of the column queried by the user is very small, that is, there are only a few fixed values, such as gender, marital status, administrative area, and so on. To index these columns with low cardinality, you need to create a bitmap index.

For the gender column, the bitmap index forms two vectors. The male vector is 10100.... Each bit of the vector indicates whether the row is male. .

RowId

1

2

3

4

5

...

male

1

0

1

0

0

 

Female

0

1

0

1

1

 ...

  For the marital status column, the bitmap index produces three vectors, 11000... for married, 00100... for unmarried, and 00010... for divorced.

RowId

1

2

3

4

5

...

Married

1

1

0

0

0

 

unmarried

0

0

1

0

1

 

divorce

0

0

0

1

0

 

   When we use the query statement "select * from table where Gender='male' andMarital="unmarried";", we first take out the male vector 10100..., then take out the unmarried vector 00100..., and perform the AND operation on the two vectors , then a new vector 00100... is generated, and it can be found that the third digit is 1, indicating that the third row of data in the table is the result we need to query. 

RowId

1

2

3

4

5

male

1

0

1

0

0

and

 

 

 

 

 

unmarried

0

0

1

0

1

result

0

0

1

0

0

3. The bitmap index adapts to the scene

As mentioned above, bitmap indexes are suitable for columns with only a few fixed values, such as gender, marital status, administrative region, etc., and the type of ID number is not suitable for bitmap indexes.

  In addition, bitmap indexes are suitable for static data and are not suitable for indexing frequently updated columns. For example, there is such a field busy , which records the busyness of each machine. When the machine is busy, busy is 1 , and when the machine is not busy, busy is 0 .

  At this time, some people will say to use bitmap indexes, because busy has only two values. Well, we index the busy field with a bitmap index ! Suppose user A uses update to update the busy value of a machine , such as update table set table.busy=1 where rowid=100; , but has not yet committed , and user B also uses update to update the busy value of another machine , update table set table.busy=1 where rowid=12;  At this time, user B cannot be updated, and needs to wait for user A to commit .

  Reason: User A updates the busy value of a machine to 1 , which will cause the bitmap vector of all machines whose busy value is 1 to change. Therefore, the database will lock all rows with busy = 1 , and unlock it only after commit .

Hash index

The index column will be stored in the table in the matching hash bucket . This table will have the actual data row pointer, and then look up the corresponding data row according to the actual data row pointer.

In summary, to find a row of data or process a where clause, the SQL Server engine needs to do the following things

1. Generate a suitable hash function according to the parameters in the where condition

2. Match the index column, match the corresponding hash bucket , and find the corresponding hash bucket means that the corresponding data row pointer ( row pointer ) is also found

3. Read data

Hash index is simpler than B- tree index, because it does not need to traverse the B- tree, so the access speed will be faster

Disadvantages of Hash Index:

1. Because the Hash index compares the values ​​calculated by Hash, it can only be used for equality comparison and cannot be used for range queries

2. Since the hash values ​​are arranged in order, the real data mapped by the hash values ​​are not necessarily arranged in order in the hash table, so the Hash index cannot be used to speed up any sorting operation

3. Partial index keys cannot be used to search, because the combined index is calculated together when calculating the hash value.

4. When the hash value is repeated a lot and the amount of data is very large, the retrieval efficiency is not as high as that of the Btree index.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324769130&siteId=291194637