MySQL clustered index and non-clustered index & mysql index why use b tree

  • Clustered index: put the data storage and index together, and find the index and find the data
  • Non-clustered index: data storage and index are separated. The leaf nodes of the index structure point to the corresponding rows of the data. Myisam first caches the index in the memory through key_buffer. When the data needs to be accessed (access data through the index), in the memory Search the index directly, and then find the corresponding data on the disk through the index, which is why the speed is slow (disk IO) when the index is not hit by the key buffer.

 

Clarify a concept: In InnoDB, the index created on the clustered index is called auxiliary index. Accessing data from auxiliary index always requires secondary search. Non-clustered indexes are auxiliary indexes, such as compound index, prefix index, unique index. Index, auxiliary index leaf nodes no longer store the physical location of the row, but the primary key value.

When to use clustered vs nonclustered indexes

                                                               cluster index

 

InnoDB Clustered Index

  1. InnoDB uses a clustered index, which organizes the primary key into a B+ tree , and the row data is stored on the leaf nodes . If the condition of "where id = 14" is used to find the primary key, the retrieval algorithm of the B+ tree is The corresponding leaf node can be found, and then row data can be obtained .
  2. If you perform a conditional search on the Name column, you need two steps : the first step is to retrieve the Name in the auxiliary index B+ tree, and reach its leaf node to obtain the corresponding primary key . The second step uses the primary key to perform a B+ tree retrieval operation on the primary index B+ tree species, and finally reaches the leaf node to obtain the entire row of data . ( The point is that auxiliary indexes need to be established through other keys )

Advantages of Clustered Indexes

It seems that the efficiency of the clustered index is obviously lower than that of the non-clustered index, because every time the auxiliary index is used for retrieval, it has to go through two B+ tree searches . Isn't this unnecessary? What are the advantages of clustered indexes?

  1. Since row data and leaf nodes are stored together, there will be multiple rows of data in the same page. When accessing different row records of the same data page, the page has been loaded into the Buffer . Access the disk. In this way , the primary key and row data are loaded into the memory together, and the row data can be returned immediately when the leaf node is found . If the data is organized according to the primary key ID, the data can be obtained faster .
  2. The advantage of using the primary key as a "pointer" for the secondary index instead of using the address value as a pointer is that it reduces the maintenance work of the secondary index when a row is moved or a data page is split . Using the primary key value as a pointer will make the secondary index take up more The advantage of this is that InnoDB does not need to update this "pointer" in the secondary index when moving rows . That is to say, the position of the row (located by the 16K Page in the implementation) will change with the modification of the data in the database (the previous B+ tree node split and the Page split), using the clustered index can ensure that regardless of the primary key How the nodes of the B+ tree change, the auxiliary index tree is not affected .
  3. Clustered indexes are suitable for sorting, non-clustered indexes are not suitable
  4. When fetching a certain range of data, use a clustered index
  5. The secondary index requires two index searches instead of one to get the data, because the storage engine needs to find the leaf node of the index through the secondary index for the first time, so as to find the primary key of the data, and then use the primary key to search again in the clustered index index, then find the data
  6. Related data can be kept together . For example, when implementing an e-mail box, data can be aggregated based on user IDs, so that only a few data pages need to be read from disk to get all of a user's emails. Without a clustered index, each message could cause a disk I/O.

Disadvantages of Clustered Indexes

  1. Maintaining indexes is expensive, especially when new rows are inserted or when the primary key is updated to cause page splits . It is recommended to optimize the table through OPTIMIZE TABLE during a period of low load after inserting a large number of new rows, because the row data that must be moved may cause fragmentation. Fragmentation can be weakened by using exclusive tablespaces
  2. Because the table uses UUId (random ID) as the primary key, the data storage is sparse, and the clustered index may be slower than the full table scan.

 

So it is recommended to use auto_increment of int as the primary key

image

Primary key values ​​are sequential, so InnoDB stores each record after the previous record. When the maximum fill factor of the page is reached (InnoDB's default maximum fill factor is 15/16 of the page size, leaving some space for later modification), the next record is written to a new page. Once the data is loaded in this sequential manner, the primary key pages are approximately filled with sequential records (the secondary index pages may be different)

  1. If the primary key is relatively large, the secondary index will become larger, because the leaf of the secondary index stores the primary key value; too long primary key value will cause non-leaf nodes to occupy more physical space

Why primary key is usually recommended to use auto-increment id

The physical storage order of the data of the clustered index is consistent with the index order , that is, as long as the indexes are adjacent, the corresponding data must also be adjacently stored on the disk . If the primary key is not an auto-incrementing id, then you can imagine what it will do, constantly adjust the physical address of the data, paging, and of course there are other measures to reduce these operations, but it cannot be completely avoided. However, if it is self-incrementing, it is simple, it only needs to write page by page, the index structure is relatively compact, the disk fragmentation is less, and the efficiency is also high.

Because the main index of MyISAM is not a clustered index, the physical addresses of its data must be messy. Get these physical addresses and perform I/O reading according to the appropriate algorithm, so it starts to seek and rotate continuously. . Clustered indexes require only one I/O . (strong contrast)

However, if it involves operations such as sorting, full table scan, and count of large amounts of data, MyISAM has the advantage, because the index occupies a small space, and these operations need to be completed in memory .

Clustered index settings in mysql

The clustered index is the primary key by default . If there is no primary key defined in the table, InnoDB will choose a unique non-null index instead. If there is no such index, InnoDB implicitly defines a primary key as a clustered index. InnoDB only aggregates records in the same page. Pages containing adjacent key values ​​may be far apart.

MyISM nonclustered index

MyISM uses a non-clustered index. The two B+ trees of the non-clustered index look no different . The structure of the nodes is exactly the same, but the stored content is different. The nodes of the primary key index B+ tree store the primary key, and the secondary key index B+ tree storage auxiliary key. The table data is stored in a separate place. The leaf nodes of the two B+ trees both use an address to point to the real table data. For table data, there is no difference between these two keys. Since the index tree is independent, retrieval by the secondary key does not require access to the index tree of the primary key .

 

 

Why does mysql index use b+ tree

The trie tree is dead at the beginning, no doubt the AVL tree is the best in terms of query, but it may cause nightmares when deleting;

It seems that the red-black tree is the most suitable. Although he sacrifices part of the query performance, the deletion performance maintains a constant time complexity in most cases.

However, one of the most important problems is that MySQL data is stored externally, that is to say, disk IO is the key to the performance bottleneck, so what we need is to reduce the depth of the tree, so we need more forks Trees also require data structures that are better suited to the characteristics of disk operations.

 

A B+ tree is a data structure designed for disks or other direct-access secondary storage devices. Why mysql chooses B+ tree is essentially because mysql data is stored in external storage.

  • B+ tree

       If you haven't seen its words, first hear its pictures. (It's really not well defined)

                    

                                                 (an ordinary B+ tree)

     Properties (m-forked B+ tree):

  1. Each node in the tree has at most m children.
  2. Except for the root and leaf nodes, every other node has at least [m/2] children.
  3. If the root node is not a leaf node, it has at least 2 children.
  4. All leaf nodes appear at the same level.
  5. Each non-terminal node contains n keyword information: (A0, K1, A1, K2, A2, ..., Kn, An). Among them, Ki (i=1...n) is the keyword, and the keywords are sorted in order Ki < K(i-1). Ai is a node pointing to the root of the subtree, and the keys of all nodes pointed to by the pointer A(i-1) in the subtree are less than Ki, but all are greater than K(i-1). The number of keywords n must satisfy: [m/2]-1 <= n <= m-1

       Of course, there is also a property that n keywords only have n children, which will not be discussed here.

    Advantage:

  1. Only leaf nodes record data, and non-leaf nodes only contain indexes; all non-terminal nodes (internal nodes) do not store data information, but save the minimum value of their leaf nodes as an index. In this way, the more keywords that need to be searched are read into memory at one time. Relatively speaking, the number of IO reads and writes is also reduced.

  2. It can provide a stable and efficient range-query function; this is why the file system in the database and operating system usually uses the b+ tree as the data index. This feature is mainly because all leaf nodes are connected to each other, and the leaf nodes themselves Links in ascending order of keyword size.

The B+ tree only has leaf nodes to store data, and the rest of the nodes are used for indexing, while the B-tree is that each index node will have a Data field. Therefore, from the perspective of Mysql (Inoodb), B+ tree is used as an index. Generally speaking, the index is very large, especially the index with a large amount of data such as relational database can reach the level of 100 million, so in order to reduce the memory usage , the index will also be stored on disk.
So how does Mysql measure query efficiency? – Disk IO times.  The characteristic of B-tree/B+ tree is that the number of nodes in each layer is very large and the number of layers is very small. The purpose is to reduce the number of disk IO, but each node of the B-tree has a data field (pointer), which undoubtedly increases The size of the node is increased, and the number of disk IOs is increased (the amount of data read by disk IO at one time is fixed, and the single data becomes larger, and the number of reads each time is less, the number of IOs increases, and one IO takes more time), and Except for leaf nodes, other nodes of the B+ tree do not store data. If the nodes are small, the number of disk IOs is small. This is one of the advantages.
Another advantage is that  all the data fields of the B+ tree are located in the leaf nodes. Generally speaking, an optimization is carried out, that is, all the leaf nodes are linked with pointers . In this way , all the data can be obtained by traversing the leaf nodes, so that interval access can be performed. Range-based queries are very frequent in databases, and B-trees do not support such traversal operations.

The difference between B-tree and red-black tree

AVL numbers and red-black trees are basically data structures that are only used when stored in memory . In large-scale data storage, the red-black tree often occurs because the depth of the tree is too large , causing the disk IO to read and write too frequently, which leads to low efficiency. Why does this happen? We know that to obtain the data on the disk, we must first move the disk moving arm to the cylinder where the data is located, then find the specified disk surface, then rotate the disk surface to find the track where the data is located, and finally read and write the data. The cost of disk IO is mainly spent on finding the required cylinders, and the depth of the tree will cause frequent reading and writing of disk IO. According to the number of disk lookups and accesses, the height of the tree is often determined by the height of the tree . Therefore, as long as we reduce the structure of the tree through a better tree structure, the height of the tree can be reduced as much as possible. The B tree can have multiple children, ranging from dozens to up. Thousands, you can reduce the height of the tree.

The designers of the database system cleverly used the principle of disk read-ahead to set the size of a node equal to one page, so that each node can be fully loaded with only one I/O. In order to achieve this goal, the following techniques are also needed in the actual implementation of B-Tree: each time a new node is created, a space for a page is directly applied, so as to ensure that a node is also physically stored in a page, and the computer storage allocation is all Aligned by page, a node only needs one I/O.

refer to:

Why does Mysql use B+ tree as index instead of B-tree

The reason why B+Tree has unique advantages in database indexing

 

 

Kotlin developer community

The first Kotlin developer community official account in China, mainly sharing and communicating related topics such as Kotlin programming language, Spring Boot, Android, React.js/Node.js, functional programming, programming ideas, etc.

The more noisy the world, the more quiet thinking is needed.

A tree that is hugged is born at the end of a millimeter;
a platform of nine layers begins with a pile of earth;
a journey of a thousand miles begins with one step.
Accumulate soil to form mountains, wind and rain will rise; Accumulate
water to form an abyss, dragons will be born; Accumulate
goodness to form virtue, and the gods will be content, and the holy heart will be prepared.
Therefore, if you do not accumulate a few steps, you will not be able to reach a thousand miles;
if you do not accumulate small streams, you will not be able to make a river or sea.
Qiji leaps, but cannot
take ten steps;
With perseverance, rotten wood will not break;
with perseverance, gold and stone can be carved.
The worm has no claws and teeth, but has strong bones and muscles. It eats the soil of Egypt on the top and drinks the yellow spring on the bottom.
Crab six kneels and has two claws. It is not a snake and an eel's hole that has no sustenance. It is also impatient.
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324497775&siteId=291194637