Chapter 07_InnoDB data storage structure

Chapter 07_InnoDB data storage structure

1. Database storage structure: page

1.1 Basic unit of disk and memory interaction: page

1.2 Page Structure Overview

1.3 pages in size

1.4-page superstructure

2. Internal
structure

Part 1: File Header and File Trailer

1. Database storage structure: page

The index structure provides us with an efficient indexing method, but the index information and data records are stored in the file, or in the page structure to be precise. On the other hand, indexes are implemented in the storage engine. The storage engine on the MySQL server is responsible for reading and writing data in the table. The formats stored in different storage engines are generally different, and even some storage engines such as Memory do not use disks to store data.

Since InnoDB is the default storage engine of MySQL, this chapter analyzes the data storage structure of the InnoDB storage engine.

1.1 The basic unit of disk and memory interaction: page

InnoDB divides data into several pages. The default page size in InnoDB is 16KB.

As an interaction between disk and memory 基本单位, that is, at least 16KB of content is read from the disk to the memory at a time, and at least 16KB of the content in the memory is refreshed to the disk at a time. In other words, **in the database, whether reading one row or multiple rows, the page where these rows are located is loaded. In other words, the basic unit of database management storage space is page (Page), and the smallest unit of database I/O operation is page. **Multiple row records can be stored in one page

Records are stored in rows, but database reads are not in row units. Otherwise, one read (that is, one I/O operation) can only process one row of data, and the efficiency will be very low.

Insert image description here

1.2 Page Structure Overview

Page a, page b, page c...page n can be different 在物理结构上相连, as long as they are 双向链表related. The records in each data page will be composed in order of primary key value from small to large 单向链表. Each data page will generate a record for the records stored in it . When searching for a record through the primary key, you can quickly locate it 页目录in the page directory. 中使用二分法Go to the corresponding slot, and then traverse the records in the corresponding group of the slot to quickly find the specified record.

1.3 pages in size

Different database management systems (DBMS for short) have different page sizes. For example, in MySQL's InnoDB storage engine, the default page size is 16KB , which can be viewed with the following command:

show variables like '%innodb_page_size%';
/*
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
*/

The size of a page in SQL Server is 8KB, while in Oracle the term '' '' (Block) is used to represent "page". The block sizes supported by Oracle are 2KB, 4KB, 8KB, 16K8, 32KB and 64KB.

1.4 Page Superstructure

In addition, in the database, there are also concepts of extent, segment and tablespace. The relationship between rows, pages, extents, segments, and table spaces is as shown in the figure below:

Insert image description here

Extent is a storage structure one level larger than a page. In the InnoDB storage engine, an extent is allocated 64个连续的页. Because the default page size in InnoDB is 16KB, the size of a region is 64*16KB= 1MB.

A segment consists of one or more areas. An area is a continuously allocated space in the file system (64 consecutive pages in InnoDB). However, areas in a segment are not required to be adjacent to each other. 段是数据库中的分配单位,不同类型的数据库对象以不同的段形式存在。When a data table or index is created, the corresponding segment will be created accordingly. For example, when a table is created, a table segment is created, and when an index is created, an index segment is created.

Tablespace is a logical container. The objects stored in tablespace are segments. There can be one or more segments in a tablespace, but a segment can only belong to one tablespace. The database consists of one or more table spaces. Table spaces can be divided into system table spaces from a management perspective, such as , 用户表空间, 撤销表空间, 临时表空间etc.

2. Internal structure of the page

If pages are divided by type, common ones include 数据页(保存B+树节点), 系统页, Undo页and 事务数据页etc. The data page is the page we use most often.
The storage space of the size of the data page 16KBis divided into seven parts, namely File Header, Page Header, Maximum and Minimum Records (Infimum+supremum), User Records, and Free Space. Space), Page Directory and File Tailer.

A schematic diagram of the page structure is shown below:

Insert image description here

The functions of these seven parts are as follows. A brief summary is shown in the following table:

name Occupied size illustrate
File Header 38 bytes File header, information describing the page
Page Header 56 bytes Page header, page status information
Least-Supreme 26 bytes Maximum and minimum records, which are two virtual row records
User Records uncertain User records, storage row record content
Free Space uncertain Free records, space on the page that has not been used yet
Page Directory uncertain Page directory, which stores the relative location of user records
File Trailer 8 bytes At the end of the file, check whether the page is complete

We can divide these 7 structures into 3 parts

Part 1: File Header and File Trailer

The first is 文件通用部分, that is, 文件头and 文件尾.

① File header information

Insert image description here

2.3 How to query B + tree from the perspective of data page

A B+ tree can be divided into two parts according to node types:
1. Leaf node, the node at the bottom of the B+ tree, the height of the node is o, and stores row records.
2. Non-leaf nodes, the height of the node is greater than 0, store index keys and page pointers, and do not store the row record itself.

Insert image description here

When we understand the structure of the B+ tree from the page structure, it can help us understand some principles of retrieval through indexes:

1.How does B+ tree perform record retrieval?

If you query row records through the index of the B+ tree, first start from the root of the B+ tree and search layer by layer until you find the leaf node, that is, until you find the corresponding data page, load the data page into the memory, the slot in the page directory (slot) uses 二分查找a method to first find a rough record group and then search for records in the group 链表遍历.

2. What is the difference in query efficiency between ordinary indexes and unique indexes?

When we create an index, it can be a normal index or a unique index. So what is the difference in query efficiency between these two indexes?

The unique index adds constraints to the ordinary index, that is, the keyword is unique, and the search stops when the keyword is found. In ordinary indexes, there may be situations where the keywords in user records are the same. According to the principle of page structure, when we read a record, we do not read the record separately from the disk, but read the record where it is located. The page is loaded into memory for reading. The page size of the InnoDB storage engine is 16KB, and thousands of records may be stored in one page. Therefore, searching on the fields of a normal index means several more " " 判断下一条记录operations in the memory. For the CPU, these operations The time consumed is negligible. Therefore, when retrieving an index field, there is basically no difference in retrieval efficiency between using a normal index or a unique index.

3.InnoDB row format (or record format)

We usually insert data into the table in units of rows. The way these records are stored on disk is also called row format or record format.

The InnoDB storage engine has designed four different types of row formats, namely Compact (tight), Redundant (redundant), Dynamic (dynamic) and Compressed (compressed) row formats. Check

Default row format for MySQL8 and MySQL5.7:

mysql> select @@innodb_default_row_format;
+-----------------------------+
| @@innodb_default_row_format |
+-----------------------------+
| dynamic                     |
+-----------------------------+
1 row in set (0.00 sec)

# 查询单张表行格式
mysql> show table status like 'departments' \G
*************************** 1. row ***************************
           Name: departments
         Engine: InnoDB
        Version: 10
 #行格式  Row_format: Dynamic
           Rows: 27
 Avg_row_length: 606
    Data_length: 16384
Max_data_length: 0
   Index_length: 49152
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2022-03-23 14:56:38
    Update_time: 2022-03-23 14:56:38
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.01 sec)

4. Area, segment and fragment area

4.1 Why should there be zones?

B+The pages in each level of the tree will form a doubly linked list. If the storage space is allocated 页为单位later , the two adjacent pages in the doubly linked list 物理位置may be very far apart. When we introduced the applicable scenarios of the B+ tree index, we specifically mentioned that the range query only needs to locate the leftmost record and the rightmost record, and then scan along the doubly linked list. If there are two adjacent pages in the linked list, The physical location is very far away, that's what it's called 随机I/0. Once again, the speed of the disk is several orders of magnitude different from the speed of the memory, 随机I/0是非常慢的so we should try to make the physical locations of adjacent pages in the linked list adjacent, so that we can use the so-called when performing range queries 顺序I/0.

This takes advantage of the read-ahead feature of the disk

[View 4.n extension to understand how mysql uses the read-ahead feature] (#4.n extension)

The introduced concept is that a zone is continuous in physical location . Because the page size in InnoDB defaults to 16KB, the size of a region is 64*16KB= . When in the table , when allocating space for an index, it is no longer allocated in units of pages, but in units of allocation. Even when there is a lot of data in the table, multiple consecutive areas can be allocated at one time. Although it may cause (insufficient data to fill the entire area), from a performance perspective, it can eliminate a lot of random I/O !64个页1MB数据量大区为单位一点点空间的浪费功大于过

There are 64 consecutive pages here, but the two specific pages are still connected by pointers. Ensure that a large area is continuous.

4.2 Why do we need paragraphs?

For range queries, the records in the leaf nodes of the B+ tree are actually sequentially scanned. If the leaf nodes and non-leaf nodes are not distinguished and all the pages represented by the nodes are placed in the applied area, the effect of the range scan will be Big discount. 叶子节点Therefore, InnoDB treats the sum of B+ trees 非叶子节点differently, that is to say, leaf nodes have their own unique areas, and non-leaf nodes also have their own unique areas. The set of areas storing leaf nodes is considered one segment 段( segment), and the set of areas storing non-leaf nodes is also considered a segment. In other words, an index will generate 2 segments, one 叶子节点段and one 非叶子节点段.

In addition to the leaf node segments and non-leaf node segments of the index, InnoDB also has segments defined for storing some special data, such as rollback segments. Therefore, common segments include 数据段, 索引段, 回滚段. The data segment is the leaf node of the B+ tree, and the index segment is the non-leaf node of the B+ tree.

In the InnoDB storage engine, the management of segments is completed by the engine itself, and the DBA cannot and does not need to control it. This simplifies the DBA's management of segments to a certain extent.

A segment does not actually correspond to a continuous physical area in the table space, but is a logical concept consisting of several scattered pages and some complete areas.

Scattered pages, see the fragment area

4.3 Why is there a fragmentation area?

By default, a table using the InnoDB storage engine has only one clustered index. One index will generate 2 segments, and the segments apply for storage space in units of zones. By default, one zone occupies 1M (64*16Kb=1024Kb) storage Space, so by default, does a small table that only stores a few records also need 2M of storage space? Will I have to apply for 2M more storage space every time I add an index? This is simply a problem for a table that stores relatively few records. What a waste. The crux of this problem is that the areas we have introduced so far are very pure , that is, an entire area is allocated to a certain segment, or all pages in the area exist to store data in the same segment. Even if the data in the segment does not fill all the pages in the zone, the remaining pages cannot be used for other purposes.

In order to consider the situation where allocating a complete area to a segment 数据量较小is too wasteful of storage space for the table, InnoDB proposed a 碎片(fragment)区concept. In a fragmented area, not all pages exist to store data of the same segment, but the pages in the fragmented area can be used for different purposes. For example, some pages are used for segment A, and some pages are used for segment A. Segment B, some pages do not even belong to any segment. 碎片区直属于表空间, does not belong to any segment.

So the strategy for allocating storage space for a certain segment is as follows:

  • When data is first inserted into the table, the segment is allocated storage space in single page units from a fragmented area.
  • When a segment has occupied 32个碎片区a page, it will apply for allocation of storage space in complete area units.

So now a segment cannot only be defined as a collection of certain areas, but more accurately should be a collection of some scattered pages and some complete areas .

4.4 Classification of areas

Districts can generally be divided into 4 types:

  • 空闲的区(FREE): No pages in this area are used yet.
  • 有剩余空间的碎片区(FREE_FRAG):Indicates that there are still available pages in the fragment area.
  • 没有剩余空间的碎片区(FULL_FRAG): Indicates that all pages in the fragmented area are used and there are no free pages.
  • 附属于某个段的区(FSEG):Each index can be divided into leaf node segments and non-leaf node segments.

FREEThe areas in FREE_FRAGthese FULL_FRAGthree states are independent and belong directly to the table space . The area in FSEG state is attached to a certain segment.

If the table space is compared to a group army, sections are equivalent to divisions and districts are equivalent to regiments. Generally, regiments are affiliated to a certain division, just like FSEGthe districts they are in are all subordinate to a certain section, but the districts in FREE, FREE_FRAGand and FULL_FRAGthese three states are directly subordinate to the table space, just like independent regiments directly take orders from the army. All the same.

4.n extension

So, how can a computer determine whether a piece of data may be used next?

Temporal Locality

Temporal locality: If an information item is being accessed, it is likely to be accessed again in the near future.

// This is understandable, of course the used data may be used again.

Spatial Locality

Spatial locality: Information that will be used in the near future is likely to be spatially close to the information being used now.

// The data next to a certain data address being used is of course also likely to be used, such as an array, collection, etc.

Order Locality

Sequential locality: In a typical program, most instructions are performed sequentially, except for transfer instructions. The ratio of sequential execution to non-sequential execution is roughly 5:1. Additionally, access to large arrays is sequential.

The sequential execution of instructions, the continuous storage of arrays, etc. are the causes of sequential locality.

// Most of the instructions being executed and the instructions still queued for processing are executed in sequence.

Disk read-ahead principle

The reading and writing speed of memory is much faster than that of disk, but the memory capacity is much smaller than disk, and the execution of data and programs cannot be executed until they are transferred into memory. Therefore, memory and disk must frequently perform I/O operations. I/O operations are a It is a time-consuming process. Although modern systems already have the support of channel (I/O processor) technology, this is far from enough (the processing speed of the CPU is much greater than the speed of disk I/O).

Therefore, when the disk is read, nearby data will be loaded into the cache.

Disk reading (details)

Disk access, disk I/O involves mechanical operations. The disk is composed of coaxial circular disks of the same size. The disks can rotate (each disk must rotate at the same time). There is a head bracket on one side of the disk. The head bracket fixes a set of heads. Each head is responsible for accessing the contents of a disk. The magnetic head does not move and the disk rotates, but the magnetic arm can move forward and backward to read data on different tracks. A track is a series of concentric rings centered on the platter. The track is divided into small segments, called sectors, which are the smallest storage units of the disk.

When the disk is read, the system transfers the logical address of the data to the disk. The control circuit of the disk will parse out the physical address (which track, which sector), so the head needs to move back and forth to the corresponding track - seek. The time consumed is called seek. ——Seek time. The disk rotates to transfer the corresponding sector to the head (the head finds the corresponding sector of the corresponding track). The time consumed is called——rotation time. This series of operations is very time-consuming.

focus

In order to minimize I/O operations, computer systems generally use pre-reading, and the length of pre-reading is generally an integral multiple of the page. Pages are logical blocks of computer-managed memory. Hardware and operating systems often divide main memory and disk storage areas into consecutive equal-sized blocks. Each storage block is called a page (in many operating systems, the page size is usually 4k), main memory and disk exchange data in units of pages. When the data to be read by the program is not in the main memory, a page fault exception will be triggered. At this time, the system will send a read signal to the disk. The disk will find the starting position of the data and move backwards 连续读取一页或几页载入内存中. Then an exception will be returned and the program will continue to run. .

Computer systems read and store in pages. Generally, one page is 4KB (8 sectors, each sector is 125B, 8*125B=4KB). The smallest unit for each read and access is one page, and * * 磁盘预读时通常会读取页的整倍数**. According to the [Locality Principle] mentioned in the article ① when a piece of data is used, the data nearby is usually used immediately. ②The data required during program running is usually concentrated. Since disk sequential reads are very efficient (no seek time, very little spin time), the disk will read a page of data even if only one byte needs to be read.

As for disk paging, refer to the computer operating system's paging, segmented storage management - logical addresses and physical addresses are divided into pages of the same size, called pages in logical addresses and blocks in physical addresses.

Why use B-Tree/B+Tree

Data structures such as red-black trees of binary search tree evolution varieties can also be used to implement indexes, but file systems and database systems generally use B-Tree/B+Tree as the index structure.

Generally speaking, the index itself is also very large and cannot be stored entirely in memory, so the index is often stored on disk in the form of an index file. In this case, disk I/O consumption will be generated during the index search process. Compared with memory access, the consumption of I/O access is several orders of magnitude higher. Therefore, the most important indicator to evaluate the quality of a data structure as an index is The asymptotic complexity of the number of disk I/O operations during the search process. In other words, the structural organization of the index should minimize the number of disk I/O accesses during the search process.

Analyze the maximum number of nodes that need to be visited in a B-Tree/B+Tree retrieval:

h=Insert image description here

The database system cleverly utilizes the principle of disk read-ahead, 一个节点的大小设为等于一个页so that each node only needs one I/O to be fully loaded. In order to achieve this goal, the following techniques need to be used in actual implementation of B-Tree:

  每次新建节点时,直接申请一个页的空间,这样就保证一个节点物理上也存储在一个页里,加之计算机存储分配都是按页对齐的,就实现了一个node只需一次I/O。

A retrieval in B-Tree requires at most h-1 I/O (the root node is resident in memory), and the asymptotic complexity is O(h)=O(logmN). In general practical applications, m is a very large number, usually more than 100, so h is very small (usually no more than 3).

To sum up, using B-Tree as an index structure is very efficient.

For structures like red-black trees, h is obviously much deeper. Since nodes (parents and children) that are logically close may be physically far away and locality cannot be exploited, the I/O asymptotic complexity of the red-black tree is also O(h), and the efficiency is obviously much worse than that of the B-Tree.

B-tree and B+Tree

B-Tree: If a retrieval requires access to 4 nodes, the database system designer uses the principle of disk read-ahead to design the size of the node to be one page. Then reading one node only requires one I/O operation to complete the retrieval operation. , requiring up to 3 I/Os (the root node is resident in memory). The smaller the data record, the more data is stored in each node, the smaller the height of the tree, the fewer I/O operations, and the retrieval efficiency increases.

B+Tree: Non-leaf nodes only store keys, which greatly reduces the size of non-leaf nodes. Then each node can store more records, the tree is shorter, and there are fewer I/O operations. So B+Tree has better performance.

5.Table space

The table space can be regarded as the highest level of the logical structure of the InnoDB storage engine. All data is stored in the table space.

There is one table space 逻辑容器, and the objects stored in the table space are segments. There can be one or more segments in a table space, but a segment can only belong to one table space. A table space database consists of one or more table spaces. Table spaces can be managed into 系统表空间(System
tablespace), 独立表空间(File-per-table tablespace), 撤销表空间(Undo Tablespace) and 临时表空间(Temporary Tablespace).

5.1 Independent table space

Independent table space, that is, each table has an independent table space, that is, the data and index information will be stored in its own table space. Independent table spaces (ie: single tables) can be used between different databases 迁移.

The space can be reclaimed (the DROPTABLE operation can automatically reclaim the table space; in other cases, the table space cannot be reclaimed by itself). For statistical analysis or log tables, after deleting a large amount of data, you can use: alter table TableName engine=innodb; to reclaim unused space. For tables that use independent table spaces, no matter how they are deleted, the fragmentation of the table space will not seriously affect the performance, and there is still a chance to deal with it.

Independent table space structure

An independent table space consists of segments, areas, and pages. This has been explained before.

The file size corresponding to the real table space.
When we look in the data directory, we will find that the file corresponding to a newly created table .ibdonly occupies 96K6 pages (in MySQL 5.7). This is because the table space occupies a very large space at the beginning. Small, because there is no data in the table. But don’t forget about these .ibd files 自扩展的. As the data in the table increases, the files corresponding to the table space also gradually increase.

Check the table space type of InnoDB:

# 查看是否独立表空间
mysql> show variables like 'innodb_file_per_table';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+
1 row in set, 1 warning (0.00 sec)

7 page sizes in MySQL8.0. Reason.idb also stores the table structure. . . Table structure.frm canceled

5.2 System table space

The structure of the system table space is basically similar to that of the independent table space, except that since the entire MySQL process has only one system table space, some additional pages about the entire system information will be recorded in the system table space, which is not available in the independent table space.

lnnoDB data dictionary

Whenever we insert a record into a table, MySQL校验过程as follows:

First, check whether the table corresponding to the insert statement exists, and whether the inserted columns match the columns in the table. If there is no problem with the syntax, you also need to know the root page corresponding to the clustered index of the table and all secondary indexes. Which table space and which page, and then insert the record into the B+ tree of the corresponding index. Therefore, in addition to saving the user data we inserted, MySQL also needs to save a lot of additional information, for example:

-某个表属于哪个表空间,表里边有多少列
-表对应的每一个列的类型是什么
-该表有多少索引,每个索引对应哪几个字段,该索引对应的根页面在哪个表空间的哪个页面
-该表有哪些外键,外键对应哪个表的哪些列
-某个表空间对应文件系统上文件路径是什么
- ...

The above data is not INSERTthe user data we inserted using statements. In fact, it is some additional data that has to be introduced in order to better manage our user data. This data is also called 元数据. The InnoDB storage engine specifically defines some columns 内部系统表(internalsystem table) to record these metadata:

Table Name describe
SYS_TABLES Information about all tables in the entire InnoDB storage engine
SYS_COLUMNS Information about all columns in the entire InnoDB storage engine
SYS_INDEXES Information about all indexes in the entire InnoDB storage engine
SYS_FIELDS Information about columns corresponding to all indexes in the entire InnoDB storage engine
SYS_FOREIGN Information about all foreign keys in the entire InnoDB storage engine
SYS_FOREIGN_COLS Information about all columns corresponding to foreign keys in the entire InnoDB storage engine
SYS_TABLESPACES All table space information in the entire InnoDB storage engine
SYS_DATAFILES All table spaces in the entire InnoDB storage engine correspond to the file paths of the file system.
SYS_VIRTUAL Information about all virtual generated columns in the entire InnoDB storage engine

These system tables are also called 数据字典, and they are all B+stored in certain pages of the system table space in the form of trees. Among them, SYS_TABLESthese
SYS_COLUNNSfour SYS_INDEXEStables SYS_FIELDSare particularly important and are called basic system tables (basic system tables).

Note: The user is 不能直接访问these internal system tables of InnoDB, unless you directly parse the files on the file system corresponding to the system table space. However, considering that viewing the contents of these tables may help you analyze the problem, information_schemasome tables starting with innodb_sys are provided in the system database:

mysql> USE information_schema ;
Database changed
mysql> SHOW TABLES LIKE 'innodb_sys%';
+--------------------------------------------+
| Tables_in_information_schema (innodb_sys%) |
+--------------------------------------------+
| INNODB_SYS_DATAFILES                       |
| INNODB_SYS_VIRTUAL                         |
| INNODB_SYS_INDEXES                         |
| INNODB_SYS_TABLES                          |
| INNODB_SYS_FIELDS                          |
| INNODB_SYS_TABLESPACES                     |
| INNODB_SYS_FOREIGN_COLS                    |
| INNODB_SYS_COLUMNS                         |
| INNODB_SYS_FOREIGN                         |
| INNODB_SYS_TABLESTATS                      |
+--------------------------------------------+
10 rows in set (0.00 sec)

information_schemaThese tables starting with SYS in the database INNODB_SYSare not real internal system tables (the internal system tables are the tables starting with SYS above). Instead, these SYSsystem tables starting with SYS are read when the storage engine starts, and then filled into These are
INNODB_SYSin tables starting with . The fields in the table starting with INNODB_SYSand the table starting with are not exactly the same, but it is enough for your reference.SYS

Appendix: Three ways to load data pages

InnoDB reads data from disk 最小单位as data pages. The data with id = xxx you want to get is one of the many rows in this data page.

For the data stored in MySQL, we call it a table logically. At the physical level such as a disk, it is 按数据页stored in a form. When it is loaded into MysQL, we call it a table 缓存页.

If there is no data for this page in the buffer pool, the buffer pool has the following three ways to read data, and the reading efficiency of each method is different:

1. Memory reading

If the data exists in memory, the execution time is basically about 1ms, and the efficiency is still very high.
Insert image description here

2. Random reading

If the data is not in memory, you need to search for the page on the disk. The overall time is estimated to be around 10ms6ms. Of these 10ms, 6ms are the actual busy time of the disk (included 寻道和半圈旋转时间), and 3ms are the possible queuing time. estimate, plus 1ms of transfer time to transfer the page from the disk server buffer to the database buffer. This 10ms seems very fast, but in fact it takes a very long time for the database, because it is only the time to read one page.

Insert image description here

3. Sequential reading

Sequential reading is actually a batch reading method. Because we requested 数据在磁盘上往往都是相邻存储的, sequential reading can help us read pages in batches. In this case, loading them into the buffer pool at one time does not require separate disk I for other pages. /O operation. If the throughput of a disk is 40MB/S, then for a 16KB page, 2560 (40MB/16KB) pages can be read sequentially at one time, which is equivalent to a page read time of 0.4ms. Using batch reading, even reading from disk is more efficient than reading a single page from memory.

Guess you like

Origin blog.csdn.net/github_36665118/article/details/134139684