Red-black tree, B-tree, B+ tree, MySQL index interview questions

red black tree

1 Red-black tree characteristics

  1. Each node is either black or red.
  2. The root node is black.
  3. Each leaf node (NIL) is black. [Note: The leaf node here refers to the leaf node that is empty (NIL or NULL)!
  4. If a node is red, its children must be black.
  5. All paths from a node to its descendants contain the same number of black nodes. [This refers to the path to the leaf node] The height of a red-black tree containing n internal nodes is O(log(n)).

2 Usage scenarios of red-black trees

Red-black trees used in Java include TreeSet and JDK1.8's HashMap.
But the question arises, why use red-black trees? The insertion and deletion of red-black trees must meet the above five characteristics and perform very complex operations.
Reason:
The red-black tree is a balanced tree, and its complex definitions and rules are to ensure the balance of the tree. If the tree does not guarantee its balance, it will be as shown below: Obviously, it will become a linked list.

The biggest purpose of ensuring balance is to reduce the height of the tree, because the search performance of the tree depends on the height of the tree . Therefore, the lower the height of the tree, the higher the search efficiency!
This is why there are binary trees, search binary trees, etc., the purpose of various types of trees.

B-tree

Overview

B-tree, B here means balance (meaning balance), B-tree is a multi-way self-balancing search tree (B-tree is a multi-way balanced search tree). It is similar to an ordinary balanced binary tree, with one
difference A B-tree allows each node to have more child nodes. The figure below is a simplified diagram of a B-tree.
Insert image description here

Characteristics of 1B tree

Wikipedia defines a B-tree as "In computer science, a B-tree is a tree-like data structure that can store data, sort it, and allow O(log n) time complexity A data structure that performs searches, sequential reads, insertions, and deletions. A B-tree, in summary, is a binary search tree in which a node can have more than 2 child nodes. Unlike a self-balancing binary search tree, a B-tree is The system optimizes the reading and writing operations of large blocks of data. The B-tree algorithm reduces the intermediate process experienced when locating records, thereby speeding up access. It is commonly used in databases and file systems."

The conditions for a B-tree of order m to be satisfied:

每个节点至多有m棵子树
根节点除外,其它每个分支节点至少有【m/2】棵子树
根节点至少有两棵子树(除非B树只包含一个节点)
所有叶子节点在同一层上,B树的叶子节点可以看成一种外部节点,不包含任何信息。
有j个孩子的非叶结点恰好有j-1个关键码,关键码按递增次序排列。
B 树又叫平衡多路查找树。

The picture below is a B-tree of order M=4.
Insert image description here
You can see that the B-tree is an extension of the 2-3 tree. It allows a node to have more than 2 elements.

The insertion and balancing operations of B-tree are very similar to those of 2-3 tree, so they will not be introduced here. The following is to insert into the B tree one by one

6 10 4 14 5 11 15 3 2 12 1 7 8 8 6 3 6 21 5 15 15 6 32 23 45 65 7 8 6 5 4

See this link for the animation.
Original link: https://www.yycoding.xyz/post/2014/3/29/introduce-b-tree-and-b-plus-tree

2. Usage scenarios of B-tree

B-trees are mostly used for indexing file systems.
So here comes the question: Why use B-tree? Aren’t red-black trees good?
Reason:
Compared with binary trees and red-black trees, B-trees have more subtrees, which means more paths. More subtrees mean the lower the height of the number and the higher the search efficiency. Of course, if there are too many paths, it may become An ordered array (as shown below). So of course it is impossible to make the number of paths infinite.
Insert image description here

Why do data structures such as B-trees appear?

There are many balanced binary trees traditionally used for search, such as AVL trees, red-black trees, etc. These trees provide very good query performance under normal circumstances , but they fail when the data is very large. The reason is that when the amount of data is very large, the memory is not enough, most of the data can only be stored on the disk , and only the required data is loaded into the memory. Generally speaking, the memory access time is about 50 ns, while the disk access time is about 10 ms. The speed difference is nearly 5 orders of magnitude, and the disk read time far exceeds the time of data comparison in memory. This shows that the program will be blocked on disk IO most of the time. So how do we improve program performance? Reduce the number of disk IOs. Balanced binary trees such as AVL trees and red-black trees cannot "cater" to disks by design.

The time for a memory access, SSD hard disk access and SATA hard disk random access is about tens of nanoseconds, tens of microseconds and tens of milliseconds respectively.

Insert image description here
The picture above is a simple balanced binary tree. The balanced binary tree is maintained through rotation, and rotation is an operation on the entire tree. If part of the tree is loaded into the memory, the rotation operation cannot be completed. Secondly, the height of a balanced binary tree is relatively large as log n (the base is 2), so logically close nodes may actually be very far away, and disk read-ahead cannot be well utilized (the principle of locality), so this type of balanced binary tree is in the database and file system selections are passed.

The principle of spatial locality: If a certain location in memory is accessed, then locations nearby will also be accessed.

Let's look at the design of the B-tree from the perspective of "catering" the disk.

The efficiency of indexing depends on the number of disk IOs. Fast indexing needs to effectively reduce the number of disk IOs. How to index quickly? The principle of indexing is actually to continuously narrow the search range, just like we usually use a dictionary to look up words, first find the first letter to narrow the range, then the second letter, and so on. A balanced binary tree divides the range into two intervals each time. In order to be faster, B-tree divides the range into multiple intervals each time. The more intervals, the faster and more accurate the positioning data is. Then if the nodes are interval ranges, each node will be larger. Therefore, when creating a new node, directly apply for page-sized space (the disk storage unit is divided into blocks, usually 512 Byte). Disk IO reads several blocks at a time, which we call a page. The specific size depends on the operating system, usually 4k, 8k or 16k), computer memory allocation is page-aligned, so that a node only needs one IO.
Insert image description here
The picture above is a simplified B-tree. The benefits of multiple forks are very obvious. It effectively reduces the height of the B-tree, which is log n with a large base. The size of the base is related to the number of child nodes of the node. Generally, a B-tree -The height of the tree is about 3 stories. The lower the number of layers, the more accurate the range determined by each node area, and the faster the range is reduced (it is definitely much faster than deep-level searches in binary trees). As mentioned above, a node needs to perform IO once, so the total number of IOs is reduced to log n times. Each node of the B-tree is n ordered sequences (a1, a2, a3...an), and the child nodes of the node are divided into n+1 intervals for indexing (X1< a1, a2 < X2 < a3, … , an+1 < Xn < anXn+1 > an).

Comment: Each node of the B-tree stores multiple values. Unlike the binary tree, where one node has one value, the B-tree gives each node a small range. When there are more intervals, search It is faster. For example, if there are 1-100 numbers, the binary tree can only be divided into two ranges at a time, 0-50 and 51-100, while the B-tree is divided into four ranges: 1-25, 25-50, 51- 75, 76-100 can filter out three-quarters of the data in one go. So B-tree as a multi-tree is faster

B+tree

There are many variants of B-Tree, the most common of which is B+Tree. For example, MySQL commonly uses B+Tree to implement its index structure.

Compared with B-Tree, B+Tree has the following differences:

The pointer limit for each node is 2d instead of 2d+1.

Internal nodes do not store data, only keys; leaf nodes do not store pointers.

1. Characteristics of B+ tree

1 The middle node of k subtrees contains k elements (k-1 elements in B-tree). Each element does not store data and is only used for indexing. All data is stored in leaf nodes.

2 All leaf nodes contain information about all elements and pointers to records containing these elements, and the leaf nodes themselves are linked in ascending order according to the size of the keywords.

3 All intermediate node elements exist in child nodes at the same time, and are the largest (or smallest) elements among child node elements.

The difference between B+ tree and B tree

The B+ tree is a variant of the B- tree and is also a multi-way search tree. Its differences from the B- tree are:

  1. All keywords are stored in leaf nodes, internal nodes (non-leaf nodes do not store real data)
  2. Added a chain pointer to all leaf nodes

Why does the database use B+ trees instead of B trees and red-black trees?

1. First, let’s talk about why red-black trees don’t work:

The red-black tree must be stored in memory. The database table is too large and cannot be stored.
Even if you find a way to save the red-black tree to the hard disk, searching for a node in the red-black tree requires at most logN levels, and each level is a memory page (although you just want to find a node, the hard disk must read one page at a time) ..), then there will be a total of logN IO times, which can’t be hurt!
Therefore, we must consider reducing the number of tree layers to reduce the number of IOs and speed up the efficiency of querying and modifying the database. Both b and b+ trees conform to this property. Each node of them has many children (tens to thousands), so the entire tree The height can be reduced very low.

For example, if there are 100000000 data and each node has 1000 children, then log 1000(100000000)<3, 3 levels are enough to store it!

2. Let’s first talk about the difference between b-tree and b+ tree:

All nodes of the b-tree are data nodes, but only the leaf nodes of the b+ tree are data nodes . Non-leaf (internal) nodes only play a guiding role and do not store actual data.
All data nodes of the b+ tree are at the lowest level (leaf node level), and adjacent nodes are connected by linked lists.
Note: The time for reading data from a disk is not much different between reading one byte, reading 10 bytes, and reading a page. This is because most of the disk search time is spent on seeking, and rotation is basically not time-consuming.

Disk simplified structure diagram
Insert image description here

The disk is divided into a series of concentric rings, with the center of the circle being the center of the disk. Each concentric ring is called a track, and all tracks with the same radius form a cylinder. The track is divided into small segments along the radius line. Each segment is called a sector, and each sector is the smallest storage unit of the disk. For the sake of simplicity, we assume below that the disk has only one platter and one head.

When data needs to be read from the disk, the system will pass the data logical address to the disk, and the disk's control circuit will translate the logical address into a physical address according to the addressing logic, that is, determine which track and sector the data to be read is on. In order to read the data in this sector, the magnetic head needs to be placed over the sector. In order to achieve this, the magnetic head needs to move to align with the corresponding track. This process is called seeking.

The principle of locality and disk read-ahead

Due to the characteristics of the storage medium, the access speed of the disk itself is much slower than that of the main memory. Coupled with the consumption of mechanical movement, the access speed of the disk is often one hundredth of that of the main memory. Therefore, in order to improve efficiency, it is necessary to minimize the number of disks. I/O. In order to achieve this goal, the disk often does not read strictly on demand, but reads in advance every time. Even if only one byte is needed, the disk will start from this position and sequentially read a certain length of data backwards into the memory. The theoretical basis for doing this is the famous locality principle in computer science:

When one piece of data is used, nearby data is usually used immediately.

The data required during program execution is usually concentrated.

Because sequential disk reads are very efficient (no seek time, very little spin time), read-ahead can improve I/O efficiency for programs with locality.

The read-ahead length is generally an integral multiple of the page. Pages are logical blocks of computer-managed memory. Hardware and operating systems often divide main memory and disk storage areas into consecutive equal-sized blocks. Each storage block is called a page (in many operating systems, the page size is usually 4k), main memory and disk exchange data in units of pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered. At this time, the system will send a read signal to the disk, and the disk will find the starting position of the data and read one or more pages backwards. Load into memory, then return abnormally, and the program continues to run.

The detailed principles of disks can be found here: Data structure and algorithm principles behind MySQL indexes

3. Let’s talk about why b-tree is not as good as b+ tree:

1 The internal nodes of the b-tree store actual data. For example, a node is a page of 4096 bytes, and each piece of data is 128 bytes. Then a node can only store 32 data items, and the maximum number of corresponding child nodes is 33, which is obviously not enough. The internal nodes of the b+ tree are only used as guides, and only one integer is stored, 4096/4=1024 data items. In this way, each node of the b+ tree has more children, and the height of the entire tree is lower, which greatly increases query efficiency.

2 The leaf nodes of the b+ tree are connected by linked lists, which is suitable for range queries because adjacent pages can be read directly. But b-tree cannot do this.

Performance analysis of B-/B+ index

Insert image description here

Guess you like

Origin blog.csdn.net/qq_41398619/article/details/126630871