MySQL principle - InnoDB engine - row record storage - Off-page column

This article is based on MySQL 8

In the previous two articles, we analyzed the two row record storage formats of the MySQL InnoDB engine:

Here is a brief summary:

  • Compact format structure:
    • Variable-length field length table : includes the length of each variable-length field whose data is not NULL , and is arranged in reverse order according to the column order
    • NULL value list : For fields that can be NULL, use a BitMap to identify which fields are NULL
    • Record header information : fixed 5 bytes, including:
      • Useless bits : 2 bits, currently useless
      • deleted_flag : 1 bits, identifies whether the record is deleted
      • min_rec_flag : 1 bits, whether it is the minimum record flag of non-leaf nodes in the B+ tree
      • n_owned : 4 bits, record the number of records owned in the corresponding slot
      • heap_no : 13 bits, the serial number of the record in the heap, which can also be understood as the location information in the heap
      • record_type : 3 bits, record type, common data record is 000, node pointer type is 001, the first record of the pseudo record is infimum behavior 010, the last record of pseudo record is supremum behavior 011, 1xx is reserved
      • next_record pointer : 16 bits, the relative position of the next record in the page
    • Hide columns :
      • DB_ROW_ID : 6 bytes, this column may not be generated. The user-defined primary key is preferred as the primary key. If the user does not define a primary key, a Unique key is selected as the primary key. If there is no Unique key defined in the table, a hidden column named DB_ROW_ID will be added to the table as the primary key by default.
      • DB_TRX_ID : 6 bytes, the transaction id of the current record item is generated. Every time a new transaction is started, the system version number will be automatically incremented, and the system version number at the beginning of the transaction will be used as the transaction id. If the transaction commits, it will be updated here DB_TRX_ID
      • DB_ROLL_PTR : 7 bytes, undo log pointer, pointing to the undo log of the current record item, you need to pass this pointer to find the data of the previous version. If the transaction is rolled back, read the original value from the undo log and put it in the record
    • Data column :
      • bigint: If it is not NULL, it occupies 8 bytes . The first bit is the sign bit, and the remaining bits store numbers. The number range is -2^63 ~ 2^63 - 1 = -9223372036854775808 ~ 9223372036854775807. If NULL, no storage space is used
      • double: non-NULL column, conforming to the unified standard of IEEE 754 floating-point "double format" bit layout, if it is NULL, it does not occupy any storage space
      • For fixed-length fields, you can directly store the data without storing the length information . If it is not enough to set the length, it will be supplemented . For example  , for char type , add  0x20 , which corresponds to a space.
      • varchar storage: Because there is a variable-length field length list at the beginning of the data, varchar only needs to hold the actual data, and does not need to be filled with additional data. But we haven't considered the case of storing particularly long data
  • The difference between the Redundant format structure and the  Compact format :
    • List of all field lengths : Unlike the Compact row format, Redundant starts with a list of all field lengths : records the length offsets of all fields , including hidden columns. The offset is that the first field has a length of a and the second field has a length of b, so the first field in the list is a, and the second field is a + b. Sort all fields in reverse order
    • Record header information : fixed 6 bytes
      • Useless bits : 2 bits, currently useless
      • deleted_flag : 1 bits, identifies whether the record is deleted
      • min_rec_flag : 1 bits, whether it is the minimum record flag of non-leaf nodes in the B+ tree
      • n_owned : 4 bits, record the number of records owned in the corresponding slot
      • heap_no : 13 bits, the serial number of the record in the heap, which can also be understood as the location information in the heap
      • n_field : 10 bits, the number of columns for this record, ranging from 1 to 1023
      • 1byte_offs_flag : 1 bit, 1 means the length of each field is stored as 1 byte, 0 means 2 bytes
      • next_record pointer : 16 bits, the relative position of the next record in the page
    • Data column :
      • CHAR type storage : regardless of whether the field is NULL, or what the length is, char(M) will occupy as many bytes as M * the maximum length of the byte encoding . If it is NULL, it is filled with 0x00, if it is not NULL, if the length is not enough, add 0x20 at the end.

I haven't analyzed how to store when the field is relatively long before, but I will analyze it in detail in this article.

Let's revisit the page mentioned earlier here . Because each piece of data is a hard disk addressing read, we want to reduce the number of hard disk addressing reads, we can consider reading data piece by piece, so that the data we need for the next request is likely to be in memory , it saves reading from the hard disk. Based on this idea, InnoDB divides the data of a table into several pages ( pages ), and these pages are linked by B-Tree indexes. Each page size defaults to 16384 Bytes which is 16KB (configured  innodb_page_size).

For relatively large fields, such as fields of type Text, if they also exist on this clustered index, the data of this node will be too large, and many pages will be read at once, which will reduce the reading efficiency (for example, in our case without a request to read the Text column). Therefore, InnoDB generally tends to store long variable-length fields in other places, which involves the design pattern of Off-page columns. Different  line formats are  handled differently.

Before we start discussing the processing of different  row formats  , let's review the page size of InnoDB. InnoDB is a persistent storage engine, that is, data is stored on disk. But reading and writing data, for data processing, these are happening in memory. That is, data needs to be read from disk to memory. So how does this read read? If you process which piece of data, read which one to the memory, which is too inefficient. Because each piece of data is a hard disk addressing read, we want to reduce the number of hard disk addressing reads, we can consider reading data piece by piece, so that the data we need for the next request is likely to be in memory , it saves reading from the hard disk. Based on this idea, InnoDB divides the data of a table into several pages (pages) , and these pages are linked by B-Tree indexes. Each page size defaults to 16384 Bytes which is 16KB (configured  innodb_page_size). It can be modified when MySQL is started, and it can only be one of 4096, 8192, and 16384.

Off-page column processing in Redundant

For longer columns in the Redundant row format, only the first 768 bytes will be stored on the data row, and the rest of the data will be placed on other pages. Let's look at an example, run the following SQL, create a test table, and insert test data:

drop table if exists long_column_test;
CREATE TABLE `long_column_test` (
`large_content` varchar(32768) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=REDUNDANT;

##长度为 768 字节
insert into long_column_test values (repeat("az", 384));
##长度为 8100 字节
insert into long_column_test values (repeat("az", 4050));
##长度为 32768 字节
insert into long_column_test values (repeat("az", 16384));

We use the 64-bit encoder to view the table file  long_column_test.ibd, and we can see that the first piece of data is a normal piece of data, and its storage is the same as the Redundant column storage we talked about before, without any special:

image

所有字段长度列表(8字节,4列,一个数据列,三个隐藏列):03 13(768+7+6+6),00 13(7+6+6),00 0c(6+6), 00 06(6)
记录头(6字节):00 00 10 08 03 ac
隐藏列 DB_ROW_ID(6字节):00 00 00 00 02 22 
隐藏列 DB_TRX_ID(6字节):00 00 00 00 58 b7
隐藏列 DB_ROLL_PTR(7字节):82 00 00 01 0c 01 10 
数据列 large_content(768字节):61 7a ......

For the second row, we find that  large_content the data of the column of this row is not completely stored in this row, but partly stored in this row, and the other part is stored in other places. This kind of column is called  off-page  column, which stores The other places that arrive are called  overflow pages , which are structured as follows:
image

The first is the data column

所有字段长度列表(8字节,4列,一个数据列,三个隐藏列):43 27(第一字节的头两位不代表长度,最高位还是标记字段是否为NULL,第二位标记这条记录是否在同一页,由于不为 NULL,所以最高位为 0,由于存在 overflow 页所以不在同一页,所以第二位为1,后面的 3 27 代表长度,即 20+768+7+6+6),00 13(7+6+6),00 0c(6+6), 00 06(6)
记录头(6字节):00 00 10 08 03 ac
隐藏列 DB_ROW_ID(6字节):00 00 00 00 02 22 
隐藏列 DB_TRX_ID(6字节):00 00 00 00 58 b7
隐藏列 DB_ROLL_PTR(7字节):82 00 00 01 0c 01 10 
数据列 large_content(768字节):61 7a ......
指向剩余数据所在地址的指针(20字节):00 00 05 23 00 00 00 05 00 00 00 01 00 00 00 00 00 00 1c a4

For off-page columns, there will be a pointer to the address of the remaining data at the end of the column data. This pointer occupies 20 bytes. Its structure is:

image

Then the overflow page where the rest of the column data is stored :

数据列 large_content(剩余的 7332 字节):61 7a ......

What about when the field is longer and exceeds the limit of data in one page? Let's look at the third row of data structures:

image

It can be seen that too long data columns will be stored on the overflow page in the form of linked list links.

It can be seen that in the Redundant line format, the structure of the off-page is actually:
image

This brings us to three questions:

  1. When does a column become an off-page column?
  2. When will the overflow page be divided into linked list nodes for storage?
  3. For which column types will this be stored?

1. When does a column become an off-page column?

First of all, we know that the page size of the innodb engine is 16KB by default , which is 16384 bytes, and  the data of innodb is loaded by page . Then, the data structure that organizes the innoDB engine data is the B+ tree . Scanning the B+ tree to find data is also loaded and searched page by page. If a page can contain more rows of data, it is obvious that the search efficiency is higher. But if there is only one piece of data on a page, then this B+ tree is actually almost as efficient as a linked list . Therefore, for efficiency, it is necessary to ensure that there are at least two pieces of data in a page . So have:

2∗Line data size<16384→Line data size<81922∗Line data size<16384→Line data size<8192

At the same time, a row of data is not only column data, but also hidden columns, record headers, column length lists, etc., and innoDB pages also have some metadata of their own (occupying 132 bytes, which we will analyze in detail in later chapters), Here we take  long_column_test as an example, there are:

page metadata size+2∗'long_column_test'row data size<16384→132+2∗(field length list length+record header length+three hidden columns length+large_content length)<16384page metadata size+2∗'long_column_test' Row data size<16384→132+2∗(field length list length+record header length+three hidden columns length+large_content length)<16384

It can be deduced that:

large_content length < 8093large_content length < 8093

In actual use, more than one column of data may be long. Also, since the data is not stored together with the row data, the search and reading efficiency will be relatively low. Therefore, the redundant row format will try not to change the columns into off-page columns, and change the columns into off-page columns as little as possible. .

2. When will the overflow page be divided into linked list nodes for storage?

The overflow page is different from the table data. It does not organize the data through the B+ tree, and does not do complex searches. It is a linked list. So we just need to ensure that the data size does not exceed one page , namely:

overflow page data node size < 16384 overflow page data node size < 16384

This data node also has some additional information, and at the same time, the page also has its own additional information, which will be seen in later articles. Therefore, the actual data size carried will need to eliminate these additional information, that is, less than 16384. If it is not enough, it will be divided into multiple pages of storage, and these nodes will be linked through a linked list.

3. For which column types will this be stored?

For variable-length fields , such as varchar, varbinary, text, blob, etc., this mechanism is used for storage. For fixed-length fields , such as char, if it is too long, it will also be stored like varchar, in this case, the end of the char will not be filled with blank characters . But this situation is not common, char can only have a maximum of 255 characters, and the character encoding must be larger than 3 bytes before it is larger than 768, such as uf8mb4 and each character is larger than 3 bytes.

Off-page column processing in Compact

The processing of off-page in Compact is basically the same as that of Redundant , just because the data structure is different:
image

The critical point that causes a column to become an off-page column is different. Here we take it  long_column_test as an example, there are:

page metadata size + 2∗ 'long_column_test' row data size < 16384→132+2∗ (variable length list 2 bytes + NULL value list 1 byte + record header length 5 bytes + three hidden column lengths (6 +6+7 bytes)+large_content length)<16384page metadata size+2∗'long_column_test'row data size<16384→132+2∗(variable length list 2 bytes + NULL value list 1 byte + record header Length 5 bytes + three hidden columns length (6+6+7 bytes) + large_content length) < 16384

It can be deduced that:

large_content length < 8099large_content length < 8099

Off-page column processing in Dynamic

Dynamic is basically the same as Compact except that off-page column processing is different from Compact .

The main difference between Dynamic's processing of off-page columns is that all data is stored on the overflow page, and only a 20-byte pointer is stored in the off-page column . The structure of this pointer is the same as the 20-byte pointer in the Redundant format:
image

Off-page column processing in Compressed

The Compressed row format is basically the same as that of Dynamic, including the processing of off-page columns, which is actually the addition of compression processing on the basis of Dynamic. For compression processing, it will be analyzed in detail in the chapter on the compression page that follows.

Guess you like

Origin blog.csdn.net/qq_41701956/article/details/118541792