How is a MYSQL record stored

Mainly depends on MYSQL default storage engine InnoDB

Every time a database is created, a directory named database will be created in the /var/lib/mysql/ directory

The directory contains the following three files

  • db.opt, used to store the default character set and character verification rules of the current database. (Database data)
  • t_order.frm, the table structure of t_order will be saved in this file. Creating a table in MySQL will generate a .frm file, which is used to save the metadata information of each table, mainly including the table structure definition. (table data)
  • t_order.ibd, the table data of t_order will be saved in this file. Table data can be stored in a shared tablespace file (file name: ibdata1) or in an exclusive tablespace file (file name: tablename.ibd). This behavior is controlled by the parameter innodb_file_per_table. If the parameter innodb_file_per_table is set to 1, the stored data, indexes and other information will be stored separately in an exclusive table space. Starting from MySQL 5.6.6, its default value is 1. , so after this version, the data of each table in MySQL is stored in a separate .ibd file. (That is the row data)

So we have to analyze this .ibd (independent table space file) for a row of data 

Table space structure:

OK:

The records in the database table are stored in rows, and each row of records has a different storage structure according to different row formats.

Page:

Records are stored in rows, but the reading of the database does not take "rows" as the unit, otherwise one reading (that is, one I/O operation) can only process one row of data, and the efficiency will be very low. (storage unit is row, read unit is page)

Therefore, InnoDB data is read and written in units of "pages" . That is to say, when a record needs to be read, it is not to read the row record from the disk, but to read and write the entire row record in units of pages. read into memory.

The default size of each page is 16KB , that is, a maximum of 16KB of continuous storage space can be guaranteed.

A page is the smallest unit of InnoDB storage engine disk management , which means that each read and write of the database is in units of 16KB. At least 16K of content is read from the disk into the memory at a time, and at least 16K of the content in the memory is refreshed to the memory at a time. disk.

There are many types of pages, common ones are data pages, undo log pages, overflow pages, and so on. The row records in the data table are managed by "data page"

district:

We know that the InnoDB storage engine uses B+ trees to organize data.

Each layer in the B+ tree is connected by a doubly linked list. If the storage space is allocated in units of pages, the physical positions between two adjacent pages in the linked list are not continuous, and may be very far apart. , then there will be a lot of random I/O during disk query, and random I/O is very slow.

It is also very simple to solve this problem, that is, let the physical positions of adjacent pages in the linked list be adjacent, so that sequential I/O can be used, and the performance will be very high when range query (scanning leaf nodes).

How to solve it?

When the amount of data in the table is large, when allocating space for an index, it is no longer allocated in units of pages, but in units of extents. The size of each area is 1MB. For 16KB pages, 64 consecutive pages will be divided into one area, so that the physical positions of adjacent pages in the linked list are also adjacent, and sequential I/O can be used. up .

segment

A table space is made up of segments, and a segment is made up of multiple extents. Segments are generally divided into data segments, index segments, and rollback segments.

  • Index segment: a collection of areas that store non-leaf nodes of the B + tree;
  • Data segment: a collection of areas that store the leaf nodes of the B + tree;
  • Rollback segment: A collection of areas that store rollback data.

 InnoDB row format

nnoDB provides 4 row formats, namely Redundant, Compact, Dynamic and Compressed row formats

  • Redundant is a very old row format. The row format used before MySQL 5.0 is basically not used now.
  • Since Redundant is not a compact row format, the compact row record storage method was introduced after MySQL 5.0. Compact is a compact row format. The original intention of the design is to allow more row records to be stored in a data page, from After MySQL 5.1, the row format is set to Compact by default.
  • Both Dynamic and Compressed are compact line formats, and their line formats are similar to Compact, because they are based on Compact and improve a little bit. After MySQL5.7, the Dynamic row format is used by default.

So we choose to learn Compact directly

First of all it looks like this

let's see first

Additional information recorded

list of variable length field lengths

Take our commonly used VARCHAR as an example. We all know that VARCHAR is a variable-length field, so we will store its length in when we store it, so that we will know how much to read when we read the data, such as TEXT, BLOB, etc. The same is true for long fields.

CREATE TABLE user (`id` int(11) NOT NULL,
       `name` VARCHAR(20) DEFAULT NULL,
       `phone` VARCHAR(20) DEFAULT NULL,
       `age` int(11) DEFAULT NULL,
       PRIMARY KEY (`id`) USING BTREE
     ) ENGINE = InnoDB DEFAULT CHARACTER SET = ascii ROW_FORMAT = COMPACT;

We created such a table Note that we set the storage engine to InnoDB (in fact, this does not need to be set to the default) character set ascii (so one character one byte) row format COMPACT

INSERT INTO user VALUES (1,'name','phone',18);

Insert such a piece of data

Both name and phone are of VARCHAR type, that is to say, both of them are variable-length data types

The length of name is four bytes and hexadecimal is 0x04 (as I said just now, the length of the variable-length field is stored in this list, and this 0x04 is stored)

phone length five bytes hexadecimal is 0x05

But speaking of which, we are using ASCII character set, which can be calculated like this. What if it is utf8 character set? UTF8 character set is a variable-length character set, so even if it is of CHAR type, it must be put in.

The number of bytes occupied by the real data of these variable-length fields will be stored in reverse order according to the order of the columns

So it stores 05 04 instead of 04 05

 But you said why it can’t be left alone, so it has to be turned upside down.

The reason for storing in reverse order 

The main reason is that the pointer to the next record in the "record header information" points to the position between the "record header information" and "real data" of the next record. The advantage of this is that reading to the left is the record header information. Reading to the right is the real data, which is more convenient.

That's it

The reason why the information in the "variable-length field length list" should be stored in reverse order is that the real data of the records at the front and the field length information corresponding to the data can be stored in one CPU Cache Line at the same time, which can improve the CPU performance. Cache hit rate .

For the same reason, the information of the NULL value list also needs to be stored in reverse order.

CPU Cache is the CPU cache. Its memory is very small. To ensure that they are hit by a CPU Cache, the distance in their physical memory must be reduced. If they are placed in reverse order, we use name as an example, and its length will be updated in the variable-length field information list. The latter place is closer to its real data (column 1 value)

list of NULL values

Some columns in the table may store NULL values. If these NULL values ​​are placed in the real data of the record, it will be a waste of space, so the Compact row format stores these columns with NULL values ​​in the NULL value list.

If there are columns that allow NULL values, each column corresponds to a binary bit (bit), and the binary bits are arranged in the reverse order of the column.

  • When the value of the binary bit is 1NULL, it means that the value of the column is NULL.
  • When the value of the binary bit is 0, it means that the value of the column is not NULL.

In addition, the NULL value list must be represented by an integer number of bytes (1 byte is 8 bits). If the number of binary bits used is less than an integer number of bytes, the high bit of the byte must be complemented  0.

Let's still use the row of data just now as an example:

Because this row of data has all values, it is not null, it is all 0, because the id column is NOT NULL, so there will be no corresponding NULL value list

However, according to the requirements, it must be represented by an integer number of bytes. At present, there are only 3 bits, and we have to fill in 5 bits to form 8 bits.

So the actual data is

 

A list of NULL values ​​is also not required.

When the fields of the data table are all defined as NOT NULL, the row format in the table will not have a list of NULL values .

Therefore, when designing a database table, it is usually recommended to set the field to NOT NULL, which can save at least 1 byte of space (the list of NULL values ​​occupies at least 1 byte of space).

Note that it is at least 1 byte, not at most 1 byte. Then I define nine columns, all of which are NOT NULL, so 9 bits are more than one byte. 

record header information

There are too many data in it. Here are a few important examples.

  • delete_mask : Indicates whether this piece of data is deleted. From this we can know that when we execute delete to delete a record, we will not actually delete the record, but just mark the delete_mask of this record as 1.
  • next_record: The position of the next record. From here we can know that records are organized through linked lists. As I mentioned earlier, it points to the position between the "record header information" and "real data" of the next record. The advantage of this is that reading to the left is the record header information, and reading to the right is the real data, which is more convenient .
  • record_type: indicates the type of the current record, 0 indicates a normal record, 1 indicates a B+ tree non-leaf node record, 2 indicates the minimum record, and 3 indicates the maximum record

record real data

In addition to the fields we defined, there are three hidden fields in the record real data part, namely: row_id, trx_id, roll_pointer,

  • row_id

If we specify a primary key or a unique constraint column when we create a table, then there is no row_id hidden field. If neither a primary key nor a unique constraint is specified, InnoDB will add a row_id hidden field to the record. row_id is not required and occupies 6 bytes.

  • trx_id

Transaction id, indicating which transaction generated this data. trx_id is required and occupies 6 bytes.

  • roll_pointer

A pointer to the previous version of this record. roll_pointer is required and takes 7 bytes.

What is the maximum value of n in varchar(n)?

MySQL stipulates that except for large object types such as TEXT and BLOBs, the total length of bytes occupied by all other columns (excluding hidden columns and record header information) cannot exceed 65535 bytes .

That is to say, except for columns of TEXT and BLOBs types, the maximum limit for a row of records is 65535 bytes.

The n parameter in this varchar(n) is actually a character rather than a byte

For example, in the ascii character set, 1 character occupies 1 byte, while the UTF-8 character set can represent a character with up to 3 bytes

The case of a single field

Assuming that the database table has only one column of type varchar(n) and the character set is ascii, in this case, is the maximum value of n in varchar(n) 65535?

No, the row structure we mentioned earlier has additional information other than real data, that is, when we store

  • real data
  • The number of bytes occupied by real data
  • NULL flag, if NULL is not allowed, this part does not need

If we say that we allow NULL, then we need to use a 1-byte NULL list

more

How many bytes are needed to represent the "variable length field length" of each variable length field? The specific situations are divided into:

  • Condition 1: If the maximum number of bytes allowed to be stored in the variable-length field is less than or equal to 255 bytes, 1 byte will be used to represent the "length of the variable-length field";
  • Condition 2: If the maximum number of bytes allowed to be stored in the variable-length field is greater than 255 bytes, 2 bytes will be used to indicate the "length of the variable-length field";
  • The comparison here is actually the maximum number of bytes per character * the parameters in the brackets?=255

The field type here is varchar(65535), and the character set is ascii, which means that the maximum number of bytes allowed to be stored in a variable-length field is 65535, which meets the second condition, so 2 bytes are used to represent the "variable-length field length".

So the largest n value is 65535-1-2 = 65532

Of course, the example above is for the case where the character set is ascii. If UTF-8 is used, the calculation method of the maximum data that varchar(n) can store is different:

  • Under the UTF-8 character set, a character requires at most three bytes, and the maximum value of n in varchar(n) is 65532/3 = 21844.

What is said above is only for the calculation method of a field.

My understanding is that utf8 requires a maximum of three bytes for a character, and more than 20,000 characters can’t all be 3 bytes. If there is less, the extra data space of 3 bytes will be spared.

How does MySQL handle row overflow?

The basic unit of interaction between disk and memory in MySQL is the page. The size of a page is generally  16KB, that is  16384字节, a column of type varchar(n) can store at most  65532字节, and some large objects such as TEXT and BLOB may store more data. Sometimes a page may not be able to save a record. At this time, row overflow will occur, and more data will be stored in another "overflow page" .

If a data page cannot store a record, the InnoDB storage engine will automatically store the overflowed data in the "overflow page". In general, InnoDB data is stored in "data pages". But when a row overflow occurs, the overflowed data will be stored in the "overflow page".

When a row overflow occurs, only a part of the data of the column will be saved in the real data of the record, and the remaining data will be placed in the "overflow page", and then the real data will use 20 bytes to store the address pointing to the overflow page, so that The page where the remaining data can be found. Roughly as shown in the figure below.

The above is the processing of the Compact row format after row overflow occurs.

The two row formats, Compressed and Dynamic, are very similar to Compact, the main difference is that there are some differences in handling row overflow data.

These two formats adopt a complete row overflow method. Part of the data of the column will not be stored in the real data of the record, and only a 20-byte pointer is stored to point to the overflow page. The actual data is stored in the overflow page, which looks like this:

We all know that records are stored in rows, but the reading of the database is not in units of rows, but in units of pages. Otherwise, one read (that is, one I/O operation) can only process one row of data, and the efficiency will be very low. . The default size of a data page in InnoDB is 16kb

This also means that every read and write of the database is in units of 16KB, at least 16K of content is read from the disk to the memory at a time, and at least 16K of content in the memory is refreshed to the disk at a time.

data page structure

File Header

There are two pointers, respectively pointing to the previous data page and the next data page, and the connected pages are equivalent to a two-way linked list, as shown in the following figure:

The structure of the linked list is used so that the data pages do not need to be physically continuous, but logically continuous.

 User Records

The records in the data page form a one-way linked list according to the order of the "primary key" . The feature of the one-way linked list is that it is very convenient to insert and delete, but the retrieval efficiency is not high. In the worst case, it is necessary to traverse all the nodes on the linked list to complete the retrieval.

Therefore, there is a page directory in the data page , which acts as an index for records. Just like our book, a directory is set up for each chapter in the book. When you want to read a certain chapter, you can check the directory and find it quickly. The page number of the corresponding chapter, and the page directory in the data page is to quickly find the records.

The process of page directory creation is as follows:

  1. Divide all records into several groups, these records include the minimum record and the maximum record, but do not include the records marked as "deleted";
  2. The last record of each record group is the largest record in the group, and the header information of the last record will store the total number of records in the group as the n_owned field (the pink field in the above figure)
  3. The page directory is used to store the address offset of the last record in each group. These address offsets will be stored in sequence. The address offset of each group is also called a slot. Each slot is equivalent to The pointer points to the last record of the different group .

As can be seen from the figure, the page directory is composed of multiple slots, and the slots are equivalent to the index of the grouped records . Then, because the records are sorted according to the "primary key value" from small to large, when we search for records through slots, we can use the dichotomy method to quickly locate which slot (which record group) the record to be queried is in, and after locating the slot, then Traverse all the records in the slot to find the corresponding record , without traversing the linked list of records in the entire page starting from the smallest record.

Take the picture above as an example, the numbers of the 5 slots are 0, 1, 2, 3, 4 respectively, I want to find the user record whose primary key is 11:

  • Divide first to get that the middle position of the slot is (0+4)/2=2, and the largest record in slot 2 is 8. Because 11 > 8, it is necessary to continue searching for records from slot 2;
  • Then use binary search to find out that the middle bit of slots 2 and 4 is (2+4)/2= 3, and the largest record in slot 3 is 12. Because 11 < 12, the record whose primary key is 11 is in slot 3;
  • Here is a question, "The value corresponding to the slot is the record with the largest primary key in this group, how to find the smallest record in the group" ? For example, slot 3 corresponds to the record whose maximum primary key is 12, how to find the minimum record 9. The solution is: use slot 3 to find the record corresponding to slot 2, that is, the record whose primary key is 8. The next record of the record whose primary key is 8 is the 9 record with the smallest primary key in slot 3, and then start to search down 2 times, locate the record with primary key 11, and take out the information of this record is what we want to find.

 record limit

 

  • The records in the first group can only have 1 record;
  • The number of records in the last group can only range from 1 to 8;
  • The number of records in the remaining groups can only be between 4 and 8

Let's take a look at how the B+ tree can quickly find records with a primary key of 6. The above picture is an example:

It can be seen that when locating the page where the record is located, the page containing the record is also quickly located through the dichotomy. After locating the page, it will perform dichotomy to quickly locate the group (slot number) where the record is located in the page, and finally traverse and search in the group.

  • Above we are talking about record retrieval in a data page, because the records in a data page are limited, and the primary key values ​​are in order, so by grouping all records, and then the group number (slot number) Store it in the page directory to make it function as an index, and quickly retrieve which group the records are in through the binary search method to reduce the time complexity of retrieval.
  • However, when we need to store a large number of records, we need multiple data pages. At this time, we need to consider how to build a suitable index to facilitate the location of the page where the records are located.

    In order to solve this problem, InnoDB uses B+ tree as index . The number of disk I/O operations is crucial to the efficiency of the index, so when constructing the index, we prefer to use the "short and fat" B+ tree data structure, which requires fewer disk I/Os , and the B+ tree is more suitable for range query of keywords.

    Each node in the B+ tree in InnoDB is a data page , and the structural diagram is as follows:

    Through the above figure, we can see the characteristics of the B+ tree:

  • Only leaf nodes (the bottom nodes) store data, and non-leaf nodes (other upper nodes) are only used to store directory entries as indexes.
  • Non-leaf nodes are divided into different levels, and the search volume of each level is reduced by layering;
  • All nodes are sorted according to the size of the index key to form a doubly linked list, which is convenient for range query;
  • Starting from the root node, use the dichotomy method to quickly locate the page that meets the scope of the page and contains the query value. Because the primary key value of the query is 6, it is between the range [1, 7), so go to page 30 to find a more detailed directory item;
  • In the non-leaf nodes (page 30), continue to use the dichotomy method to quickly locate the page that matches the page range containing the query value, and the primary key value is greater than 5, so go to the leaf node (page 16) to search for records;
  • Next, in the leaf node (page 16), when searching for records through slots, use the dichotomy method to quickly locate which slot (which record group) the record to be queried is in, and after locating the slot, traverse all the records in the slot to find the primary key for 6 records.

 

Clustered Indexes and Secondary Indexes

In addition, indexes can be divided into clustered indexes and non-clustered indexes (secondary indexes). The difference between them lies in the data stored in the leaf nodes:

  • The leaf nodes of the clustered index store the actual data, and all complete user records are stored in the leaf nodes of the clustered index;
  • The leaf nodes of the secondary index store the primary key value, not the actual data.

Because the data of the table is stored in the leaf nodes of the clustered index, the InnoDB storage engine will definitely create a clustered index for the table, and since the data will only be saved physically, there can only be one clustered index .

When InnoDB creates a clustered index, it selects different columns as indexes according to different scenarios:

  • If there is a primary key, the primary key will be used as the index key of the clustered index by default;
  • If there is no primary key, select the first unique column that does not contain NULL values ​​as the index key of the clustered index;
  • In the absence of the above two, InnoDB will automatically generate an implicit auto-increment id column as the index key of the clustered index;

A table can only have one clustered index, so in order to achieve fast search of non-primary key fields, a secondary index (non-clustered index/auxiliary index) is introduced, which also uses the data structure of the B+ tree, but the secondary index The leaf nodes of the index store the primary key value, not the actual data.

The B+ tree of the secondary index is as shown in the figure below, and the data part is the primary key value:

Therefore, if a query statement uses a secondary index, but the query data is not the primary key value, then after the primary key value is found in the secondary index, it is necessary to obtain data rows from the clustered index. This process is called " Back to the table", that is to say, two B+ trees must be checked to find the data. However, when the queried data is the primary key value, because it can be queried only in the secondary index, there is no need to go to the clustered index to search. This process is called "index coverage", that is, only one B+ tree is needed to search data can be found.

 

 

Guess you like

Origin blog.csdn.net/chara9885/article/details/131547647