Why does the database index data structure use B+ tree instead of xxx?

This question is actually very interesting. In my previous article, I wrote:
1. Why can't the database index use a binary sort tree;
2. Why can't the database index use a red-black tree;

This article adds:
1. Why can't you use hash tables;
2. Why can't you use B-trees;
3. Why can you use B+ trees.

1. Why can't a binary search tree be used for database indexes?

According to the above demonstration, it is also possible to look at the binary search tree, and it is quite fast.
But why is it inappropriate to use it at the bottom of the database? This is also often asked during interviews.

We can demonstrate:
https://www.cs.usfca.edu/~galles/visualization/BST.html

Let's suppose that we add an index to Col1, then insert the binary search tree in sequence: 1, 2, 3, 4, 5, 6, 7;

image.png

You can see that it degenerates into a linked list.

When we query 7, the time complexity becomes the same as a singly linked list.

From big to small, too:
image.png

Summarized as follows:

  • If a binary search tree is used at the bottom of the database, it will degenerate into a singly linked list when the data is extreme, so it is not suitable;

Imagine, if we use the index data structure of a binary search tree for a column of auto-increment, is it very unlucky? This is the extreme situation, all on one side.

2. Why is red-black tree not suitable for database indexing?

Red-black tree is also called: binary balanced tree

Red-black trees should be familiar to Java developers. The underlying data structure in HashMap in JDK8 uses red-black trees.

Red-black trees are used in such a powerful JDK. Why is the index data structure in the database not suitable?

Still the above assumption, suppose we add the index of the red-black tree to Col1.

The process is dynamically demonstrated as follows:
Kapture 2021-03-18 at 09.24.45.gif

If we execute:

select * from table1 where Col1 = 7;

The dynamic demonstration is as follows:

As you can see, we found it after a total of 4 queries. There is still a relatively large effect before adding this index, at least not all scans.

find-red-black-7.gif

to sum up:

It can be seen from observation that every time it is inserted, the binary tree is almost adjusted to keep the height balanced.
If the amount of data is very large, it is also very time-consuming, so the red-black tree is not suitable.

3. Why can't we use Hash data structure as index data structure?

When you click into this article, you must be familiar with the Hash table.

In the case of Hash table, simply speaking, there are several characteristics:

  • 1. The position of data insertion is determined by the hash value, and the order is out of order;
  • 2. Insertion is fast;
  • 3. The search is also fast;

Let's try to insert a set of data into the hash table:

1001310114103109

We use https://www.cs.usfca.edu/~galles/visualization/ClosedHash.html to dynamically simulate the Hash table;

image.png
In order to show why the Hash table is not applicable to the database, we insert the prepared data sequentially:

The dynamic demonstration is as follows:
Kapture 2021-03-20 at 11.39.07.gif

The results are as follows:
image.png

1、

We often use sql to query a range of data in the database. For example:

select * from t where id < 15;

We know that hash tables are unordered, so by virtue of this, it is more difficult.

There should be counts in my mind, hash tables are definitely not possible.

2、

As you can see from the dynamic demonstration of inserting data, the hash values ​​of 100 and 13 are both 13.

Then it will move backward (this is also a way for the hash table to resolve conflicts).

Kapture 2021-03-20 at 11.45.25.gif

For example: we first insert 100, then insert 13;

If we want to find 13, it will be slower.

Two numbers may not be reflected, 10,000? 100,000? How about 1 million? As you can imagine, it is equivalent to a full table scan.

Therefore, the hash table is generally inappropriate.

4. Why can't B-tree be used

B-Tree is B tree, not called B minus tree.

Let's continue to simulate:
https://www.cs.usfca.edu/~galles/visualization/BTree.html

Insert 1-10, after 10 numbers:

image.png

The B-tree does solve the problem of the search range of the hash table we mentioned above.

We execute the following sql:

select * from t where id > 5;

(1) First find the
search path of 5 : 4–>6–>5;

(2) Then return to the previous layer to find 6
(3) and then find 6
(4)...

It can be seen that there will be a convoluted process, as the amount of data grows, the more convoluted processes will be, the more time is wasted.

Five, why can use B+ tree

We use this to simulate:
https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Construct a B+ tree with a number of 1-10;

image.png

Let me introduce the following tree first:

It is divided into two parts, leaf nodes and non-leaf nodes.

  • The leaf node is realized by a linked list, and all inserted data are all linked together.
  • Non-leaf nodes only store keys;
  • The leaf node stores both the key and the value;

key: the number 0-10;

value: 0-10 digital address.

image.png

Solve the problem of convolution search in B-tree. The search efficiency has also been improved overall.

E.g:

select * from t where id > 5;

Look at the picture below:
Kapture 2021-03-20 at 12.15.28.gif

As you can see, first find the non-leaf node 5, then 7, then 6, and finally find the leaf node 5.

image.png

After finding it, you can take it out in order, and you don't have to go back to the previous layer.

Guess you like

Origin blog.csdn.net/qq_17623363/article/details/115029329