Exploring the underlying storage structure of InnoDB

Why doesn't innoDB balance the binary tree?

Computer Storage Hierarchy

Computer storage devices are generally divided into two types: main memory and external memory.

Internal memory is internal memory, which has fast access speed, but small capacity, expensive price, and cannot save data for a long time (data will disappear without power on).

The external memory is disk reading. The disk reading data relies on mechanical movement. The time spent each time reading data can be divided into three parts: seek time, rotation delay, and transmission time. The seek time refers to the magnetic arm. The time required to move to the specified track is generally below 5ms for mainstream disks; the rotation delay is the disk speed we often hear. For example, a disk with 7200 revolutions means that it can rotate 7200 times per minute, which means that it can rotate 120 times per second. times, the rotation delay is 1/120/2 = 4.17ms; the transmission time refers to the time to read from or write data to the disk, generally in a few tenths of a millisecond, which is negligible compared to the first two times. Then the time to access a disk, that is, the time for a disk IO is about 5+4.17 = 9ms, which sounds pretty good, but you must know that a 50MIPS machine can execute 500 million instructions per second, because the instructions rely on It is the nature of electricity. In other words, 4.5 million instructions can be executed in one IO execution time. A read SQL requires multiple accesses to the disk, 9 milliseconds each time, which is obviously a disaster.
Please add a picture description

The access time of different media is a good thing, the data comes from the google master Jeff dean

storage medium speed
L1 cache reference reads the CPU's first-level cache 0.5 ns
Branch mispredict (transfer, branch prediction) 5 ns
L2 cache reference reads the CPU's second-level cache 7 ns
Mutex lock/unlock mutex lock\unlock 100 ns
Main memory reference read memory data 100 ns
Compress 1K bytes with Zippy 1k bytes compression 10,000 ns
Send 2K bytes over 1 Gbps network Send 2K bytes over 1Gbps network 20,000 ns
Read 1 MB sequentially from memory Read 1MB sequentially from memory 250,000 ns
Round trip within same datacenter Round trip from one datacenter, ping 500,000 ns
Disk seek Disk seek 10,000,000 ns
Read 1 MB sequentially from network Read 1 MB of data sequentially from the network 10,000,000 ns
Read 1 MB sequentially from disk Read 1MB from the disk 30,000,000 ns
Send packet CA->Netherlands->CA One remote access for one packet 150,000,000 ns

Considering that disk IO is a very expensive operation, the computer operating system has made some optimizations. When an IO is performed, not only the data at the current disk address, but also the adjacent data are read into the memory buffer, because the local The principle of read-ahead tells us that when a computer accesses data at an address, the data adjacent to it will also be accessed soon. The data read by each IO is called a page. How much data a page has is related to the operating system, generally 4k or 8k, that is, when we read the data in a page, IO actually occurs once. This theory is very helpful for the data structure design of the index.

Assume that a table has 1023 records, and the storage height of the balanced binary tree is 10. Accessing a row of data requires a disk search of 10 disk blocks. It takes 10ms to randomly read a disk block once, that is, it takes 10 10ms to access a row. To increase speed, it is necessary to access as few disk blocks as possible. An n-fork balanced sorting tree with the same number of nodes N has a height of logn(N). To reduce the number of disk accesses, n needs to be increased. If it is a 10-point sort tree, then this table only needs 3 to 4 random disk accesses.

Please add a picture description

This article explains the auxiliary table:

CREATE TABLE `test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` char(10) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB 

500 million records, the test table with a table space of 20GB has only 3 layers

➜  mysql git:(stable) innodb_space -s ibdata1 -T  identity/test -I PRIMARY -l 0 index-level-summary | wc -l
 1146555
➜  mysql git:(stable) innodb_space -s ibdata1 -T  identity/test -I PRIMARY -l 1 index-level-summary | wc -l
    1093
➜  mysql git:(stable) innodb_space -s ibdata1 -T  identity/test -I PRIMARY -l 2 index-level-summary
page    index   level   data    free    records min_key
3       97      2       15288   418     1092    id=1

Apply the above formula to calculate:
Please add a picture description

The maximum number of records that can be loaded in a 1-4 layer B+ tree, the data source https://blog.jcole.us/2013/01/10/btree-index-structures-in-innodb/

Height Non-leaf pages Leaf pages Rows Size in bytes
1 0 1 468 16.0 KiB
2 1 1203 > 563 thousand 18.8 MiB
3 1204 1447209 > 677 million 22.1 GiB
4 1448413 1740992427 > 814 billion 25.9 TiB

Two B+ tree data structure

B+树It is a data structure, which is an N-ary sorted tree. Each node usually has multiple children, and one tree B+树contains root nodes, internal nodes and leaf nodes. The root node may be a leaf node, or a node containing two or more child nodes.

B+树Usually used in databases and operating systems 文件系统. File systems such as NTFS, ReiserFS, NSS, XFS, JFS, ReFS, and BFS are all used B+树as metadata indexes. B+树It is characterized by the ability to keep data stable and orderly, and its insertion and modification have a relatively stable logarithmic time complexity. B+树Elements are inserted bottom-up.

Please add a picture description

Definition of B+ tree

B+树B-树It is a variant tree that is required by the file system . The difference between a m阶tree B+树and m order B-树is:

1) A node with n subtrees contains n keywords (that is, each keyword corresponds to a subtree);

2) All leaf nodes contain information of all keywords, and pointers to records containing these keywords, and the leaf nodes themselves are linked in order from small to large according to the size of keywords;

3) All non-terminal nodes can be regarded as index parts, and the nodes only contain the largest (or smallest) key in its subtree (root node)

4) Except the root node, the number of keywords contained in all other nodes must be >=⌈m/2⌉(note: B-树all non-terminal nodes except the root have at least ⌈m/2⌉one subtree)

The figure above shows a 3-level tree B+树, usually B+树with two pointer heads, one pointing to the root node and the other pointing to the leaf node with the smallest key. Therefore, two search operations can be performed: one is to search sequentiallyB+树 from the smallest key , and the other is to search randomly from the root node .

innodb_space -s ibdata1 -T  identity/identity_app_token -I PRIMARY -l 0 index-level-summary | wc -l
innodb_space -s ibdata1 -T  identity/test -I PRIMARY -l 2 index-level-summary

Three Page structure

A page is the smallest unit of InnoDB storage engine disk management, and each page defaults to 16KB; the smallest unit of disk reading, even if only one row of records needs to be read, the data of a page needs to be loaded. Overview of the basic page structure:

Please add a picture description

File header data structure:

Please add a picture description

According to the page number, you can calculate the maximum physical space that a table can occupy. The predecessor pointer and the successor pointer make Pages of the same level form a doubly linked list, making range search possible.

User record data structure:
Please add a picture description

All records are stored in the logical structure of single linked list.

Page Directory: An array of auxiliary pointers to the linked list of User Records. Each group of 4-8 consecutive records, the Slot (slot) in the page directory points to the offset address of the last record of this group.

Space is exchanged for time, making binary search possible on singly linked lists.

innodb_space -s ibdata1 -T  identity/identity_app_token -p 3 page-records
innodb_space -s ibdata1 -T  identity/identity_app_token -p 3 page-dump | more
innodb_space -s ibdata1 -T  identity/identity_app_token -p 3 page-illustrate

Please add a picture description

A simple SQL search process

select *
from identity_app_token
where id = 787123

Quad Clustered Index & Secondary Index

In MySQL, when creating a table, a clustered index will be created for the primary key by default. The B+ tree organizes all the data in the table, that is, the data is the primary key of the index. So in InnoDB, the primary key index is also called a clustered index. Index The leaf nodes store the entire row of data. All indexes except the clustered index are called secondary indexes, and the content of the leaf node of the secondary index is the value of the primary key.

  • Aggregation (first-level index)

    • A clustered index sorts and stores rows of data in a table or view according to their key values. The index definition contains clustered index columns. There can only be one clustered index per table because the data rows themselves can only be stored in one order.
    • Data rows in a table are stored in sorted order only if the table contains a clustered index. A table is called a clustered table if it has a clustered index. If a table does not have a clustered index, its data rows are stored in an unordered structure called a heap.
  • nonclustered (also known as secondary index)

    • A nonclustered index has a structure that is independent of the data rows. A nonclustered index contains nonclustered index key values, and each key value entry has a pointer to the data row containing that key value.
    • All indexes except the primary key index are non-clustered indexes

    The leaf nodes of B+Tree store the primary key index value and row records and belong to the clustered index; if the index value and row records are stored separately, it belongs to the non-clustered index.

Please add a picture description

id(Primary key) clustered index Name (key) non-clustered index Companay
5 Gates Microsoft
7 Bezos Amazon
11 Jobs Apple
14 Ellison Oracle

The search process of the secondary index: if the secondary index meets the query requirements, it will return directly. At this time, the index is a covering index. Otherwise, you need to go back to the table to go to the primary key index (clustered index) for secondary query

  1. Each secondary index is a complete B+ tree, so the impact of disk space should be considered when adding indexes, and too many indexes or too large indexes should be added (varchar cannot be indexed entirely).
  2. In order to ensure orderliness, insert, update, and delete statements may cause page splitting behavior, resulting in performance degradation (the more secondary indexes, the greater the impact, and it is not encouraged to build too many secondary indexes)
  3. Build a joint index of (A, B, C), we can use (A), (A, B), (A, B, C) index
  4. like “%search_key” does not use the index, like “search_key%” uses the index
  5. Because the secondary index has the operation of returning to the table, so if you can use the primary key index, use the primary key index

References


  1. https://blog.jcole.us/innodb/ The innodb series blog of the No. 14 employee of mysqlab company
  2. https://github.com/jeremycole/innodb_ruby innodb underlying storage analysis tool
  3. "High performance mysql"
  4. Inside MySQL: The InnoDB Storage Engine (Second Edition)

Guess you like

Origin blog.csdn.net/bruce128/article/details/128066243