Why doesn't innoDB balance the binary tree?
Computer Storage Hierarchy
Computer storage devices are generally divided into two types: main memory and external memory.
Internal memory is internal memory, which has fast access speed, but small capacity, expensive price, and cannot save data for a long time (data will disappear without power on).
The external memory is disk reading. The disk reading data relies on mechanical movement. The time spent each time reading data can be divided into three parts: seek time, rotation delay, and transmission time. The seek time refers to the magnetic arm. The time required to move to the specified track is generally below 5ms for mainstream disks; the rotation delay is the disk speed we often hear. For example, a disk with 7200 revolutions means that it can rotate 7200 times per minute, which means that it can rotate 120 times per second. times, the rotation delay is 1/120/2 = 4.17ms; the transmission time refers to the time to read from or write data to the disk, generally in a few tenths of a millisecond, which is negligible compared to the first two times. Then the time to access a disk, that is, the time for a disk IO is about 5+4.17 = 9ms, which sounds pretty good, but you must know that a 50MIPS machine can execute 500 million instructions per second, because the instructions rely on It is the nature of electricity. In other words, 4.5 million instructions can be executed in one IO execution time. A read SQL requires multiple accesses to the disk, 9 milliseconds each time, which is obviously a disaster.
The access time of different media is a good thing, the data comes from the google master Jeff dean
storage medium | speed |
---|---|
L1 cache reference reads the CPU's first-level cache | 0.5 ns |
Branch mispredict (transfer, branch prediction) | 5 ns |
L2 cache reference reads the CPU's second-level cache | 7 ns |
Mutex lock/unlock mutex lock\unlock | 100 ns |
Main memory reference read memory data | 100 ns |
Compress 1K bytes with Zippy 1k bytes compression | 10,000 ns |
Send 2K bytes over 1 Gbps network Send 2K bytes over 1Gbps network | 20,000 ns |
Read 1 MB sequentially from memory Read 1MB sequentially from memory | 250,000 ns |
Round trip within same datacenter Round trip from one datacenter, ping | 500,000 ns |
Disk seek Disk seek | 10,000,000 ns |
Read 1 MB sequentially from network Read 1 MB of data sequentially from the network | 10,000,000 ns |
Read 1 MB sequentially from disk Read 1MB from the disk | 30,000,000 ns |
Send packet CA->Netherlands->CA One remote access for one packet | 150,000,000 ns |
Considering that disk IO is a very expensive operation, the computer operating system has made some optimizations. When an IO is performed, not only the data at the current disk address, but also the adjacent data are read into the memory buffer, because the local The principle of read-ahead tells us that when a computer accesses data at an address, the data adjacent to it will also be accessed soon. The data read by each IO is called a page. How much data a page has is related to the operating system, generally 4k or 8k, that is, when we read the data in a page, IO actually occurs once. This theory is very helpful for the data structure design of the index.
Assume that a table has 1023 records, and the storage height of the balanced binary tree is 10. Accessing a row of data requires a disk search of 10 disk blocks. It takes 10ms to randomly read a disk block once, that is, it takes 10 10ms to access a row. To increase speed, it is necessary to access as few disk blocks as possible. An n-fork balanced sorting tree with the same number of nodes N has a height of logn(N). To reduce the number of disk accesses, n needs to be increased. If it is a 10-point sort tree, then this table only needs 3 to 4 random disk accesses.
This article explains the auxiliary table:
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` char(10) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
500 million records, the test table with a table space of 20GB has only 3 layers
➜ mysql git:(stable) innodb_space -s ibdata1 -T identity/test -I PRIMARY -l 0 index-level-summary | wc -l
1146555
➜ mysql git:(stable) innodb_space -s ibdata1 -T identity/test -I PRIMARY -l 1 index-level-summary | wc -l
1093
➜ mysql git:(stable) innodb_space -s ibdata1 -T identity/test -I PRIMARY -l 2 index-level-summary
page index level data free records min_key
3 97 2 15288 418 1092 id=1
Apply the above formula to calculate:
The maximum number of records that can be loaded in a 1-4 layer B+ tree, the data source https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/
Height | Non-leaf pages | Leaf pages | Rows | Size in bytes |
---|---|---|---|---|
1 | 0 | 1 | 468 | 16.0 KiB |
2 | 1 | 1203 | > 563 thousand | 18.8 MiB |
3 | 1204 | 1447209 | > 677 million | 22.1 GiB |
4 | 1448413 | 1740992427 | > 814 billion | 25.9 TiB |
Two B+ tree data structure
B+树
It is a data structure, which is an N-ary sorted tree. Each node usually has multiple children, and one tree B+树
contains root nodes, internal nodes and leaf nodes. The root node may be a leaf node, or a node containing two or more child nodes.
B+树
Usually used in databases and operating systems 文件系统
. File systems such as NTFS, ReiserFS, NSS, XFS, JFS, ReFS, and BFS are all used B+树
as metadata indexes. B+树
It is characterized by the ability to keep data stable and orderly, and its insertion and modification have a relatively stable logarithmic time complexity. B+树
Elements are inserted bottom-up.
Definition of B+ tree
B+树
B-树
It is a variant tree that is required by the file system . The difference between a m阶
tree B+树
and m order B-树
is:
1) A node with n subtrees contains n keywords (that is, each keyword corresponds to a subtree);
2) All leaf nodes contain information of all keywords, and pointers to records containing these keywords, and the leaf nodes themselves are linked in order from small to large according to the size of keywords;
3) All non-terminal nodes can be regarded as index parts, and the nodes only contain the largest (or smallest) key in its subtree (root node)
4) Except the root node, the number of keywords contained in all other nodes must be >=⌈m/2⌉
(note: B-树
all non-terminal nodes except the root have at least ⌈m/2⌉
one subtree)
The figure above shows a 3-level tree B+树
, usually B+树
with two pointer heads, one pointing to the root node and the other pointing to the leaf node with the smallest key. Therefore, two search operations can be performed: one is to search sequentiallyB+树
from the smallest key , and the other is to search randomly from the root node .
innodb_space -s ibdata1 -T identity/identity_app_token -I PRIMARY -l 0 index-level-summary | wc -l
innodb_space -s ibdata1 -T identity/test -I PRIMARY -l 2 index-level-summary
Three Page structure
A page is the smallest unit of InnoDB storage engine disk management, and each page defaults to 16KB; the smallest unit of disk reading, even if only one row of records needs to be read, the data of a page needs to be loaded. Overview of the basic page structure:
File header data structure:
According to the page number, you can calculate the maximum physical space that a table can occupy. The predecessor pointer and the successor pointer make Pages of the same level form a doubly linked list, making range search possible.
User record data structure:
All records are stored in the logical structure of single linked list.
Page Directory: An array of auxiliary pointers to the linked list of User Records. Each group of 4-8 consecutive records, the Slot (slot) in the page directory points to the offset address of the last record of this group.
Space is exchanged for time, making binary search possible on singly linked lists.
innodb_space -s ibdata1 -T identity/identity_app_token -p 3 page-records
innodb_space -s ibdata1 -T identity/identity_app_token -p 3 page-dump | more
innodb_space -s ibdata1 -T identity/identity_app_token -p 3 page-illustrate
A simple SQL search process
select *
from identity_app_token
where id = 787123
Quad Clustered Index & Secondary Index
In MySQL, when creating a table, a clustered index will be created for the primary key by default. The B+ tree organizes all the data in the table, that is, the data is the primary key of the index. So in InnoDB, the primary key index is also called a clustered index. Index The leaf nodes store the entire row of data. All indexes except the clustered index are called secondary indexes, and the content of the leaf node of the secondary index is the value of the primary key.
-
Aggregation (first-level index)
- A clustered index sorts and stores rows of data in a table or view according to their key values. The index definition contains clustered index columns. There can only be one clustered index per table because the data rows themselves can only be stored in one order.
- Data rows in a table are stored in sorted order only if the table contains a clustered index. A table is called a clustered table if it has a clustered index. If a table does not have a clustered index, its data rows are stored in an unordered structure called a heap.
-
nonclustered (also known as secondary index)
- A nonclustered index has a structure that is independent of the data rows. A nonclustered index contains nonclustered index key values, and each key value entry has a pointer to the data row containing that key value.
- All indexes except the primary key index are non-clustered indexes
The leaf nodes of B+Tree store the primary key index value and row records and belong to the clustered index; if the index value and row records are stored separately, it belongs to the non-clustered index.
id(Primary key) clustered index | Name (key) non-clustered index | Companay |
---|---|---|
5 | Gates | Microsoft |
7 | Bezos | Amazon |
11 | Jobs | Apple |
14 | Ellison | Oracle |
The search process of the secondary index: if the secondary index meets the query requirements, it will return directly. At this time, the index is a covering index. Otherwise, you need to go back to the table to go to the primary key index (clustered index) for secondary query
- Each secondary index is a complete B+ tree, so the impact of disk space should be considered when adding indexes, and too many indexes or too large indexes should be added (varchar cannot be indexed entirely).
- In order to ensure orderliness, insert, update, and delete statements may cause page splitting behavior, resulting in performance degradation (the more secondary indexes, the greater the impact, and it is not encouraged to build too many secondary indexes)
- Build a joint index of (A, B, C), we can use (A), (A, B), (A, B, C) index
- like “%search_key” does not use the index, like “search_key%” uses the index
- Because the secondary index has the operation of returning to the table, so if you can use the primary key index, use the primary key index
References
- https://blog.jcole.us/innodb/ The innodb series blog of the No. 14 employee of mysqlab company
- https://github.com/jeremycole/innodb_ruby innodb underlying storage analysis tool
- "High performance mysql"
- Inside MySQL: The InnoDB Storage Engine (Second Edition)