[MySQL] Learn about the pages in InnoDB

I. Introduction

I didn't have any plans for the weekend, so I turned over the big guy's "How MySQL Works" and relearned the concept of pages.
The page is not very conspicuous, but deep things will touch him, love and hate, forcing people to understand.

2. Look at the page from the macro level

There is a way to improve performance in high concurrency: process a large amount of data at one time through batch processing to avoid frequent network traffic and IO.

The MySQL page is based on this concept. The disk is the carrier for storing data, and the data processing will happen in the memory, so the process can be roughly divided into:

  • S1 : First, divide the data into several pages
  • S2: Every time you read, directly read a whole page into the memory
  • S3: When reading externally, directly read and operate the data in the memory
  • S4: If a modification operation occurs, the contents of the memory need to be refreshed to the disk

page benefits

What's vague here is why a page should be spawned instead of being handled at the row level.

  • The first thing to solve is the IO problem. Of course, if you only read one page per page, then this is not an advantage. However, when we read in large batches, we often query continuous data. Relatively speaking, after the trade-off, the efficiency is better. Taller.
  • To avoid fragmentation, the row level is too low and the size is different. When using rows, the storage space is not easy to allocate
  • To improve concurrency and locks, you can reduce the granularity of transactions by controlling transactions into one page
  • Improved maintainability and versatility, easier page handling when refactoring occurs

3. Basic content of the page

The concepts associated with the page concept and the index mainly include:

  • Page: A page is the basic unit of data storage and is a fixed-size data block, usually 16K
  • Row: Row is the basic data unit in the database, representing a record in the table
  • Grouping (Group): logically divide a page except for deleted records, and take the last record of each group as the offset flag
  • Slot: The last piece of data in each group will exist as a pointer in the page directory. This pointer is a slot
    page directory (Page Directory): a data structure used to manage data pages, and pointers are recorded in the directory , index and other location information

3.1 Page Data Structure

insert image description here

  • File Header and Page Header contain the basic properties and status information of the page, etc.
  • Infimum / Supremum is a virtual line record, used to limit the boundaries of the record, they are all virtual, do not indicate any existence
  • Infimum identifies a value that is smaller than any value on the page
  • Supreme identifies a value greater than any value on the page
  • User records and idle records are the actual storage space, and the free space will become smaller and smaller as the data is inserted
  • The page directory is used to store the relative position of the record, and the query data is accelerated by means of a sparse directory
  • The purpose of File Trailer is to ensure the integrity of the data, which will store a checksum to ensure that the data is correct

Structural changes brought about by inserting data

3.2 Data row structure in user space

insert image description here

The main parameters are:

  • n_owned: the number of records owned by the current record, through which the size of each group of data is determined
  • heap_no: the position of the current record in the heap, the minimum and maximum heap_no are 0, 1 respectively, and the mark is at the top
  • next_record: The relative position of the next record, used to ensure that the data is in a linked list structure

3.3 Page Table of Contents

We have more or less been exposed to arrays or collections. There are many ways to query arrays, such as positive order or reverse order, or a more efficient dichotomy. Premise
: MySQL data is stored according to row records. In a table, row Data is in an orderly
directory: But no matter how good the algorithm is, there will still be high performance loss in scenarios with large amounts of data. To solve this scenario, MySQL adopts the method of directories. Through slots and grouping in the directory, a simplified model of data is obtained, and the corresponding group is quickly queried through the simplified data, and then circularly searched in the group

slots and grouping

There is a document that says that one data row corresponds to one slot, and that multiple records correspond to one slot. I tend to use the latter statement here, that is, sparse directories.
The page directory stores the relative positions of the records, and each relative position is a slot. In InnoDB, a sparse directory (sparse directory) is used, that is, a slot will belong to multiple records (4-8)

insert image description here

  • The minimum number of records is 1

  • The number of records in the group where the largest record is located is 1-8

  • Other groups are between 4-8

  • pointing principle

    • When querying data, first query in the page directory through dichotomy
    • After querying the range of the group, query the specific data through the next_record in the group

4. Problem set

4.1 What is the difference between an index and a data page

  • The two are not the same thing, the stored data and structure are different
  • In the index, each B+ tree node corresponds to an index page, and an index page stores index key values ​​and pointers
  • When querying data, start from the root index page, traverse the index tree, and get the pointer to the data row
  • InnoDB will locate the data page through the data row pointer in the index (directly point to the slot number through the physical address)

In addition to these pages, InnoDB also has pages that store table space header information, Buffer pages, etc.

4.2 What determines the page size

  • The page size is determined by the storage parameter innodb_page_size specified when creating the database table
  • Once the parameters are set, they cannot be changed, otherwise, a large amount of data in the page will have to be refreshed

CREATE TABLE my_table (...) ENGINE = InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8 PAGE_COMPRESSED=1 PAGE_SIZE=64K;

4.3 What does page size affect

  • Indexing efficiency: As mentioned earlier, the indexing process will quickly match the maximum and minimum values ​​of each page, and to a certain extent, larger pages will have fewer pages with the same amount of data, thereby reducing the number of index nodes. Tree height is also reduced as a result. Query efficiency will be improved
  • Memory footprint: Larger pages take up more space in memory. Because when reading, a whole page is read each time, so the memory is read more each time.
  • Other hardware impacts: Larger pages will affect disk IO and CPU, and will bring more pressure on IOPS

Summary: Improve efficiency, but increase system load.

4.4 What are the linked lists in general?

A one-way linked list formed by next_record between data rows in a column

As mentioned above, there will be a next_record parameter on each data row, which records the offset of real data reaching the real data of the next record. Here are a few points worth noting:

  • The order here is not to insert data, but the order of primary key values ​​from small to large
  • The previous item points to the position of the value of the next item, not the position of the Header header
    insert image description here

A doubly linked list between different data pages

I have seen the structure diagram above, and each page will contain two objects, File Header and Page Header.

  • Page Header: Record the status information and rules of the current page, such as the number of slots, the number of records, the number of remaining spaces, etc.
  • File Header: Record the standard information of the current page, including the page number, the table space where the page is located, the previous page number and the next page number

However, the two-way formation method is self-evident. If you know the page numbers of the previous page (FIL_PAGE_PREV) and the next page (FIL_PAGE_NEXT), then the access is no problem at all. Since only the previous and next pages are stored, it is formed. Standard linked list structure.
Supplement: The one seen above usually refers to the LRU linked list, and there is also a two-way linked list called Flush List (refresh linked list). This linked list is used to refresh the data to the disk in a certain order after the data page is modified. superior

4.5 What to do if the page is full

  • First of all, the size of the page is determined when the storage engine is created, so the space is fixed.
  • Secondly, the data in the page is sorted according to the primary key, so at this time, the insertion space is too large

In this scenario, a page split is triggered, and InnoDB does the following:

  • S1 : Create a new data page
  • S2 : Migrate some data to a new page according to the sorting method
  • S3 : Update the relationship between the upper and lower pages and the corresponding index relationship

Here, since the pages are associated by a doubly linked list, the insertion will not cause major damage to the data structure, and only the corresponding upper and lower pages need to be updated.

4.6 What to do if the page space is empty

Since there will be page splits, there may be unbalanced split pages. After a long time, many free blocks will be formed. This structure is also unreasonable. It will not only take up unnecessary space, but also cause Query performance degrades.

In order to avoid these problems, InnoDB will have the function of page consolidation, the principle and the above types. Adjacent pages attempt to merge, then re-update references and indexes.

4.7 When will deleted data be cleaned up

I saw before that after the data is deleted, the delete_mask in the directory data will be set as deleted.

At this time, the data is in the state of logical deletion, and the next_record mentioned above (the relative position of the next record) points to the subsequent normal data.

The purpose of doing this is mainly to avoid fragmentation, improve the performance of deletion (only need to modify the identification and reference), and at the same time ensure the transaction of deletion.

But in the long run, a large amount of deleted data will take up space. In order to avoid this situation, InnoDB will clean up regularly and reorganize the data pages at the same time.

4.8 Relationship between data pages, B+ trees and indexes

  • Data pages are used to store data rows, which store binary data. Usually, data rows are stored in the order of primary keys

  • B+ tree is a data structure and an index structure. The B+ tree structure makes the index more effective and easy to manage

  • The B+ tree leaf nodes in the index store index entries, and each entry corresponds to a physical pointer of a data row (usually the slot number of the data row)

    • After obtaining the slot number, read the desired data directly through the slot number and return

Pages and indexes complement each other. If there is no index, the page needs to search down in the one-way linked list until the corresponding data is found.

Summarize

Pages are the basis of storage and also the basis of indexes. After understanding pages, you can have a deeper understanding of indexes.

I don't understand this part too deeply. After all, I have almost no application scenarios for this thing. The main reason is that it is uncomfortable to read later if I don't understand it.

I tried my best to output things by myself and sorted out some problems, but after all, standing on the road repaired by others, some things cannot be guaranteed to be correct, or I may have misunderstood them. If you have any questions, I suggest you read the original text or official documents .

appendix

The header information is hardly useful in our daily business. Here are only a few parameters that I think are related to the above:

  • header information

    • PAGE_N_DIR_SLOTS : Number of slots in the page directory
    • PAGE_N_HEAP : number of records in this page
    • PAGE_GARBAGE : the number of bytes in the deleted record
    • PAGE_LAST_INSERT : where the record was last inserted
    • PAGE_DIRECTION : The direction in which records are inserted
    • PAGE_N_RECS : Number of records in this page
    • PAGE_LEVEL : The level of the current page in the B+ tree
    • PAGE_INDEX_ID: Index ID
  • file header information

    • FIL_PAGE_OFFSET : page number
    • FIL_PAGE_PREV : the page number of the previous page
    • FIL_PAGE_NEXT : the page number of the next page
    • FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID : which tablespace the page belongs to

reference documents

  • Booklet: How MySQL Works

  • Inside MySQL Technology

Guess you like

Origin blog.csdn.net/u011397981/article/details/132394956