B-tree concept

Introduction

    B-tree is a data structure designed for auxiliary storage, proposed by R.Bayer and E.mccreight in 1970. It is widely used in file systems and databases to reduce IO operations. Unfortunately, they don't explain why it is named B-tree, but according to the nature of B-tree, B is usually interpreted as Balance. In China, it is usually said that it is a B-tree. In fact, there is no B-tree, but the English B-Tree is literally translated into a B-tree.

    A typical B-tree is shown in Figure 1.

    

    Figure 1. A typical B-tree

 

    A tree that meets the following characteristics can be called a B-tree:

  •     If the root node is not a leaf node, at least two subtrees are required
  •     There are N elements in each node, and N+1 pointers. The elements in each node must not be smaller than 1/2 of the maximum node capacity
  •     All leaves are at the same level (that's why it's called a balanced tree)
  •     The left pointer of the parent node element must be smaller than the node element, and the right pointer must be larger than the node element. For example, the left pointer of Q in Figure 1 must be smaller than Q, and the right pointer must be larger than Q

 

Why use B-trees

    In a computer system, storage devices are generally divided into two types, one is main memory (such as CPU L2 cache, memory, etc.), and the main memory is generally made of silicon, which is very fast, but the cost per byte is often Much higher than secondary storage devices. Another type is auxiliary storage (such as hard disks, disks, etc.). This kind of equipment usually has a large capacity and a much lower cost, but the access speed is very slow. Let's take a look at the most common auxiliary storage - hard disk. .

    As the only mechanical storage device in the host, the hard disk is far behind the CPU and memory. Figure 2 is a typical disk drive.

    

    Figure 2. How a typical disk drive works

 

    A drive contains a number of platters that rotate around the spindle at a certain speed (for example, the common speed of PC is 7200RPM, and the server level has 10000RPM and 15000RPM), and the surface of each platter is covered with a magnetizable material. Each platter Use the magnetic head at the end of the rocker arm to read and write. The rocker arms are physically connected together by moving away from or close to the main shaft.

    Because of the mechanical moving parts, the speed of the disk is very slow compared to the memory. This mechanical movement consists of two parts: disk rotation and magnetic arm movement. Only for disk rotation, such as a common 7200RPM hard disk, it takes 60/7200≈8.33ms to make one revolution. In other words, it takes 8.33ms to make the disk rotate a complete circle to find the required data, which is more common than the memory. 100ns is about 100,000 times slower, and that doesn't include the time to move the rocker arm.

    Because the mechanical movement takes so much time, the disk reads multiple data items at a time. Generally, the smallest unit is a cluster. For SQL Server, it is one page (8K).

    However, because the data to be searched is often very large, it cannot be fully loaded into main memory. A disk is required for secondary storage. And reading the disk is the most important part of the processing time, so if we reduce the IO operations to the disk as much as possible, it will greatly speed up the speed. This is also the original intention of the B-tree design.

    The B-tree greatly reduces the IO operations for the auxiliary memory by placing the root node in the main memory and all other nodes in the auxiliary memory. For example, in Figure 1, if I want to find element Y, I just need to get the root node from the main memory, and then do an IO read according to the right pointer of the root node, and then do an IO read according to the rightmost pointer of this node, and then I can find element y. Compared with other data structures, doing only two auxiliary storage IO reads greatly reduces the search time.

 

height of B-tree

    According to the above example, we can see that the number of IO reads for auxiliary storage depends on the height of the B-tree. And what determines the height of the B-tree?

     According to the height formula of B-tree:    

      where T is the degree (the number of elements each node contains) and N is the total number of elements.

     We can see that T has a decisive influence on the height of the tree. So if each node contains more elements, with the same number of elements, it is more likely to reduce the height of the B-tree. This is why SQL Server needs to build clustered indexes with narrow keys as much as possible. Because the size of each node in SQL Server is 8092 bytes, if the key size is reduced, more elements can be accommodated, thereby reducing the height of the B-tree and improving query performance.

    The formula for the height of the B-tree above can also be derived by adding up the number of elements at each level, such as a node with degree T, the root is 1 node, the second level is at least 2 nodes, and the third level is at least 2 nodes. At least 2t nodes, and the fourth layer is at least 2t*t nodes. Add all the smallest nodes to get the formula for the number of nodes N:

               

    Taking the logarithm of both sides, you can get the formula for the height of the tree.

    This is why each node must have at least two child elements at the beginning, because according to the height formula, if each node has only one element, that is, if T=1, then the height will tend to be positive infinity.

 

Repost: http://www.cnblogs.com/CareySon/archive/2012/04/06/Imple-BTree-With-CSharp.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324937059&siteId=291194637