Detailed explanation of mysql storage engine InnoDB, from the bottom to see the InnoDB data structure

InnoDB is a storage engine that supports transaction security and is also the default storage engine for mysql. This article mainly introduces the realization principle of InnoDB row record format and data page from the perspective of data structure, and sees the InnoDB storage engine from the bottom.

Introduction to InnoDB

Everyone knows that data in mysql is stored on the physical disk, and the real data processing is performed in memory. Since the read and write speed of the disk is very slow, if the disk is frequently read and written for each operation, the performance must be very poor. In order to solve the above problems, InnoDB divides data into several pages, using pages as the basic unit of interaction between disk and memory, and the general page size is 16KB. In this case, at least one page of data is read into memory or one page of data is written to disk at a time. Improve performance by reducing the number of interactions between memory and disk.

In fact, this is essentially a typical cache design idea. The design of general caches is basically considered from the time dimension or the space dimension:

Time dimension: If a piece of data is being used, there is a high probability that it will be used again in the next period of time. It can be considered that hot data caching belongs to the realization of this idea.
Spatial dimension: If a piece of data is being used, there is a high probability that the data stored near it will be used soon. InnoDB's data pages and the operating system's page cache are the embodiment of this idea.

InnoDB row format

MySQL inserts data into the data table in units of records (a row of data). The storage method of these records on disk is called row format. MySQL supports 4 different types of row formats: Compact, Redundant (older, this article will not specifically introduce), Dynamic, Compressed.

We can specify the row format in the statement to create or modify the table:

CREATE TABLE table name (column information) ROW_FORMAT=row format name
ALTER TABLE table name ROW_FORMAT=row format name For
example, we want to create a data table record_format_demo whose row format is Compact and the character set is ascii. The sql is as follows:

mysql> CREATE TABLE record_format_demo (
-> c1 VARCHAR(10),
-> c2 VARCHAR(10) NOT NULL,
-> c3 CHAR(10),
-> c4 VARCHAR(10)
->) CHARSET=ascii ROW_FORMAT=COMPACT;
Query OK, 0 rows affected (0.03 sec)
Suppose we inserted 2 rows of data into the record_format_demo table:

mysql> SELECT * FROM record_format_demo;
±-----±----±-----±-----+
| c1 | c2 | c3 | c4 |
±-----±----±-----±-----+
| aaaa | bbb | cc | d |
| eeee | fff | NULL | NULL |
±-----±----±-----±-----+
2 rows in set (0.00 sec)

Need C/C++ Linux server architect learning materials l plus qun (812855908) to obtain (data includes C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.), free to shareInsert picture description here

COMPACT row format

Insert picture description here

As can be seen from the above figure, a complete record contains two parts, the extra information recorded and the real data recorded.

Additional information of records The additional information of
records mainly includes three types: variable-length field length list, NULL value list and record header information.

Variable-length field length list
MySQL supports some variable-length data types (such as VARCHAR(M), TEXT, etc.). The storage space they occupy for storing data is not fixed, but will change with the change of storage content. In order to accurately describe this kind of data, the storage space occupied by this variable-length field must also include:

The actual data content
The number of bytes occupied
In the Compact row format, the byte length occupied by the real data of all variable length fields is stored at the beginning of the record, thus forming a variable length field length list, each variable length field data The number of bytes occupied is stored in the reverse order of the column.

Let's take the first row of data in record_format_demo as an example. Since c1, c2, and c4 are all become data types (VARCHAR(10)), the length of these three columns should be stored at the beginning of the record.
Insert picture description here

Another point to note is that the variable-length field length list only stores the length occupied by the content of the column whose value is non-NULL, and the length of the column whose value is NULL is not stored. That is to say, for the second record, because the value of the c4 column is NULL, the variable-length field length list of the second record only needs to store the length of the c1 and c2 columns.

List of NULL values

For columns that can be NULL, in order to save storage space, mysql will not save the NULL value in the real data part of the record. Instead, it will be saved in the NULL value list in the recorded additional information.

The specific method is to first count the columns in the table that are allowed to store NULL values, and then correspond each column that allows the storage of NULL values ​​to a binary bit (1: value is NULL, 0: value is not NULL) to indicate whether to store NULL values , And arranged in reverse order. MySQL stipulates that the list of NULL values ​​must be represented by an integer number of bytes. If the number of binary bits used is not an integer number of bytes, add 0 to the high bit of the byte.

Corresponding to the record_format_demo table, c1, c3, and c4 are all allowed to store NULL values. The schematic diagram of the first two records after filling the NULL value list is like this:
Insert picture description here

Recording header information The
recording header information is composed of fixed 5 bytes (40 bits), and different bits represent different meanings:
Insert picture description here
it will not be expanded in detail temporarily.

Real data recorded

In addition to the specific data of each column, the recorded real data will also automatically add some hidden column data.
Insert picture description here
In fact, the real names of these columns are actually: DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR, row_id, transaction_id, and roll_pointer are written for beauty.

Only when the database does not define a primary key or unique key, the hidden column row_id will exist and will be used as the primary key of the data table.

Because the table record_format_demo does not define a primary key, the MySQL server will add the above three columns for each record. Now take a look at the data structure of the two records with the actual data recorded:Insert picture description here

CHAR(M) column storage format

For CHAR(M) type columns, when the column uses a fixed-length character set, the number of bytes occupied by the column will not be added to the variable-length field length list, and if a variable-length character set is used, the The number of bytes occupied by the column will also be added to the variable-length field length list.

Another thing to note is that the CHAR(M) type column of the variable-length character set requires at least M bytes, while VARCHAR(M) does not have this requirement. For example, for a CHAR(10) column using the utf8 character set, the range of the data byte length stored in the column is 10-30 bytes, even if we store an empty string in the column, it will take up 10 Bytes.

Row overflow data

VARCHAR(M) Maximum data that can be stored
MySQL has a limit on the maximum storage space occupied by a record. Except for BLOB or TEXT columns, the words occupied by all other columns (excluding hidden columns and record header information) The section length cannot exceed 65535 bytes in total. It is not rigorous to think that the storage space occupied by a row of mysql records cannot exceed 65535 bytes. In addition to the data of the column itself, this 65535 bytes also includes some other data (storage overhead). For example, in order to store a VARCHAR(M) type column, we actually need to occupy 3 parts of storage space:

Real data The
real data occupies the byte length
NULL value identification. If the column has the NOT NULL attribute, there is no storage space for this part.
Assuming that varchar_size_demo has only one VARCHAR type field, the maximum occupancy of this field is 65532 bytes. Because the length of the real data may occupy 2 bytes, the NULL value identification needs to occupy 1 byte. If the VARCHAR type column does not have the NOT NULL attribute, it can only store up to 65532 bytes of data. If the column is an ascii character set, the maximum number of corresponding characters is 65532; if it is a utf8 character set, the corresponding maximum number of characters is 21844.

Overflow caused by too much data in the record
Let's take the varchar_size_demo table in the ascii character set as an example, insert a record:

mysql> CREATE TABLE varchar_size_demo(
-> c VARCHAR(65532)
-> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO varchar_size_demo© VALUES(REPEAT('a', 65532));
Query OK, 1 row affected (0.00 sec)
The basic unit of disk-memory interaction in mysql is a page, which is generally 16KB, 16384 bytes, and A row of records can occupy up to 65535 bytes, which causes the situation that one page cannot store the next row of data. In the Compact and Redundant row formats, for columns that occupy a very large storage space, only a part of the column data will be stored in the recorded real data, and the remaining data will be scattered and stored in several other pages, and then the real data will be recorded The data location uses 20 bytes to store the addresses that point to these pages, so that the page where the remaining data is located can be found, as shown in the figure:
Insert picture description here

This situation in which only the first 768 bytes of data of the column and an address pointing to other pages are stored in the actual data of this record, and then the remaining data is stored in other pages is called row overflow, storage overrun Those pages of 768 bytes are also called overflow pages.
Insert picture description here

Critical point of row overflow

MySQL requires at least two rows of records to be stored in a page. Take the varchar_size_demo table above as an example. It has only one column c. We insert two records into this table. How many bytes of data should be inserted at least for each record to overflow? This has to analyze how the space in the page is used.

In addition to storing our records, each page also needs to store some additional information, about 132 bytes.
The extra information required for each record is 27 bytes.
Assuming that the number of data bytes stored in a column is n, if you want to ensure that the column does not overflow, you need to meet:

132 + 2×(27 + n) <16384
results in n <8099. That is to say, if the data stored in a column is less than 8099 bytes, then the column will not become an overflow column. If there are multiple columns in the table, this value is smaller.

Dynamic and Compressed row formats

The default row format in mysql is Dynamic. The Dynamic and Compressed row formats are very similar to the Compact row formats, except that there are differences in handling row overflow data. Dynamic and Compressed row formats will not store the first 768 bytes of the actual recorded data, but will store all bytes in other pages. Compressed line format uses compression algorithms to compress pages to save space.
Insert picture description here

InnoDB data page structure

We already know that the page is the basic unit of InnoDB management storage space, the size of a page is generally 16KB. InnoDB has designed many different types of pages for different purposes. Here we mainly focus on pages that store data records, which are officially called index pages. Since the index has not been introduced yet, let's call it the data page for now.

Quick view of data page structure

The data page can be divided into multiple parts in structure, and different parts have different functions, as shown in the following figure:
Insert picture description here

An InnoDB data page is divided into 7 parts. The contents of these 7 parts are roughly described below.
Insert picture description here

Storage of records in the page

The user's own stored data will be stored in User Records according to the corresponding row format. In fact, the newly generated page does not have User Records. Only when we insert data for the first time, a record-sized space will be allocated to User Records from Free Space. When Free Space is used up, it means that the current data page is also used up.
Insert picture description here

In order to make User Records clear, we must first understand the aforementioned record header information.

Understand the record header information

First briefly introduce the description of each attribute of the record header information:
Insert picture description here

Next, take the page_demo table as an example and insert some data to introduce the record header information in detail.

mysql> CREATE TABLE page_demo(
-> c1 INT,
-> c2 INT,
-> c3 VARCHAR(10000),
-> PRIMARY KEY (c1)
-> ) CHARSET=ascii ROW_FORMAT=Compact;
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO page_demo VALUES(1, 100,'aaaa'), (2, 200,'bbbb'), (3, 300,'cccc'), (4, 400,'dddd');
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
The row format of these 4 records in InnoDB is as follows (only the record header and real data are shown), the data in the columns are all expressed in decimal:
Insert picture description here

We contrast this figure to highlight the detailed information of several attributes:

delete_mask: mark whether the current record is deleted, 0 means not deleted, 1 means deleted. Undeleted records will not be removed from the disk immediately, but will be marked for deletion first, and all deleted records will form a garbage linked list. Newly inserted records may reuse the space occupied by the garbage linked list, so the storage space occupied by the garbage linked list is also referred to as reusable space.
heap_no: Indicates the position of the current record on this page. For example, the positions of the top 4 records on this page are 2, 3, 4, and 5 respectively. In fact, InnoDB will automatically add two virtual records to each page, one is the smallest record and the other is the largest record. The structure of these two records is very simple, and they are composed of a 5-byte record header information and a fixed part of 8-byte size (in fact, the content is infimum or supremum). These two records are placed separately in the Infimum + Supremum section.
Insert picture description here

As we can see from the figure, the heap_no values ​​of the smallest record and the largest record are 0 and 1, respectively, which means that their positions are the most forward.

next_record: Represents the address offset from the real data of the current record to the real data of the next record. It can be simply understood as a singly linked list, where the smallest record is the first record, and the last record is the largest record. For a more vivid display, we can use arrows to replace the address offset in next_record:
Insert picture description here

It can also be seen from the figure that the user records are actually sorted into a singly linked list according to the size of the primary key. If a record is deleted from it, the linked list will also change accordingly. For example, we delete the second record:
Insert picture description here

delete_mask
next_record
next_record

Page Directory

We already know that records are concatenated into a singly linked list in the positive order of the primary key size on the page. What if we want to find a specific record based on the primary key? The simple way is to traverse the linked list. But in the case of a relatively large amount of data, this method is obviously too inefficient. So mysql uses Page Directory (page directory) to solve this problem. The general principle of Page Directory is as follows:

Page Directory
mysql stipulates that there can only be 1 record for the group where the smallest record is located, the number of records owned by the group where the largest record is located can only be between 1-8, and the number of records in the remaining group can only be in the range It is between 4-8.

For example, there are 18 normal records in the current page_demo table. InnoDB will divide them into 5 groups. There is only one smallest record in the first group, as shown below:

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-ZereP94x-1596671641800) (https://chentianming11.github.io/images/mysql/page directory.webp) ]

The process of finding the record of the specified primary key value in a data page through Page Directory is divided into two steps:

Determine the slot where the record is located by dichotomy, and find the record with the smallest primary key value in the group where the slot is located.
Traverse each record in the group where the slot is located through the next_record attribute of the record.
The optimization of query performance of linked lists is basically achieved through dichotomy in thought. The Page Directory introduced above, the skip table and the search tree are all the same.

Page Header

Page Header is specially used to store various status information related to data pages, such as how many records have been stored on this page, what is the address of the first record, how many slots are stored in the page directory, and so on. Fixed occupancy 56 bytes, the meaning of each part of the byte attributes are as follows:
Insert picture description here

This is just a list, so I don't need to understand them all at the moment.

File Header (file header)

File Header is used to describe some general information applicable to various pages, and consists of the following content:
Insert picture description here

This is just a list, so I don't need to understand them all at the moment. We focus on a few attributes:

FIL_PAGE_SPACE_OR_CHKSUM
The checksum of the current page (checksum). For a very long byte string, we can use some algorithm to calculate a relatively short value to represent this very long byte string. This relatively short value is called the checksum. The checksum can greatly improve the efficiency of string equivalent comparison.
FIL_PAGE_OFFSET
Each page has a unique page number, and InnoDB can locate a page through the page number.
FIL_PAGE_TYPE
represents the type of the current page. As we said earlier, InnoDB divides pages into different types for different purposes.
Insert picture description here

Indicates the page number of the previous and next page of this page. Each page forms a doubly linked list through FIL_PAGE_PREV and F IL_PAGE_NEXT.
Insert picture description here

File Trailer

The basic unit of interaction between memory and disk in mysql is page. If the page in memory is modified, then the page in memory will be synchronized to the disk at some point. If there is a problem in the system during the synchronization process, the page data in the disk may not be completely synchronized, that is, a dirty page situation has occurred. In order to avoid this kind of problem, mysql adds File Trailer to the end of each page to verify the integrity of the page. File Trailer consists of 8 bytes:

The first 4 bytes represent the checksum of the page.
This part corresponds to the checksum in the File Header. Simply understand that both File Header and File Trailer have checksums. If the two are consistent, it means that the data page is complete. Otherwise, it means that the data page is dirty.
The last 4 bytes represent the corresponding log sequence position (LSN) when the page was last modified.
This part is also for verifying the integrity of the page, so I won’t understand it in detail yet.

Guess you like

Origin blog.csdn.net/qq_40989769/article/details/107841350