This article takes you to understand the redo log of MySQL

foreword

This article and the next few articles will frequently use various basic knowledge such as InnoDB记录行格式, 页面格式, 索引原理, and 表空间的组成so on. If you don’t understand these things thoroughly, then reading the following text may be particularly laborious. In order to ensure your normal understanding, please make sure that you have mastered the knowledge we have learned before.

1. What is the redo log?

We know that InnoDBthe storage engine manages the storage space on a unit-by-unit basis, and the addition, deletion, modification, and query operations we perform are actually accessing pages (including operations such as reading pages, writing pages, and creating new pages). We Buffer Poolsaid in the previous study that before actually accessing the page, the page on the disk needs to be cached in the memory Buffer Poolbefore it can be accessed. But when I 事务was studying, I emphasized a so-called 持久性feature, that is, for a committed transaction, even if the system crashes after the transaction is committed, the changes made by the transaction to the database cannot be lost. Buffer PoolBut if we only modify the page in the memory, assuming that a failure suddenly occurs after the transaction is committed, causing the data in the memory to become invalid, then the changes made to the database by the committed transaction will also follow Lost, that's something we can't stand. So how to ensure this 持久性? A very simple approach is to flush all pages modified by the transaction to disk before the transaction is committed, but this simple and crude approach has some problems:

  • Refreshing a complete data page is too wasteful.
    Sometimes we only modify a byte in a page, but we know that it is InnoDB中是以页为单位going to disk IO, that is to say, we have to change a complete page from Refresh from the memory to the disk. We also know that a page is 16KB by default. It is obviously too wasteful to refresh 16KB of data to the disk after only modifying one byte.

  • Random IO is relatively slow to refresh.
    A transaction may contain many statements, and even one statement may modify many pages. Unfortunately, the pages modified by the transaction may not be adjacent, which means that when a transaction is modified Buffer PoolWhen the pages in the disk are refreshed to the disk, a lot of work is required 随机IO, especially for traditional mechanical hard disks, random IO is much slower than sequential IO.

What should I do? Back to our original intention again: 我们只是想让已经提交了的事务对数据库中数据所做的修改永久生效,即使后来系统崩溃,在重启后也能把这种修改恢复出来. So we don't actually need to flush all the pages modified by the transaction in memory to disk every time the transaction is committed. We only need, for example, 把修改了哪些东西记录一下就好a transaction to offset page 100 in the system tablespace by To change the value of the byte at 1000 from 1 to 2, we only need to record:

Update the value at offset 1000 of page 100 in tablespace 0 to 2.

In this way, we refresh the above content to the disk when the transaction is committed. Even if the system crashes later, after restarting, just follow the steps recorded in the above content to re-update the data page, then the changes made by the transaction to the database can be restored. Recovering means meeting 持久性the requirements. Because the data page needs to be re-updated according to the steps recorded in the above content when the system crashes and restarts, the above content is also called 重做日志, the English name redo log, we can also call it redo日志. Compared with flushing all modified in-memory pages to disk when the transaction is committed, the redobenefits of flushing only the logs generated during the execution of the transaction to disk are as follows:

  • The space occupied by the redo log is very small.
    The storage space required to store the tablespace ID, page number, offset and the value that needs to be updated is very small. redoWe will talk about the format of the log in detail later. Now we only need to know one redo It would be nice if the log doesn't take up too much space.
  • redoThe log is 顺序写入on the disk.
    In the process of executing a transaction, each time a statement is executed, several redo logs may be generated. These logs are written to the disk in the order they are generated, that is, sequential IO is used.

Two, redo log format

From the above content, we know that redothe log essentially just records what modifications the transaction made to the database. Multiple types of logs InnoDBare defined for different modification scenarios of transactions to the database , but most types of logs have the following general structure:redoredo

insert image description here
The detailed explanation of each part is as follows:

  • type: The type of the redo log. InnoDB redohas designed a total 53of different types of logs, which will be introduced in detail redolater
  • space ID: tablespace ID
  • page number: page number
  • data: The specific content of the redo log

2.1 Simple redo log type

As we mentioned earlier InnoDB的记录行格式, if we do not explicitly define a table 主键and the Unique key is not defined in the table, InnoDB will automatically add a so-called row_idhidden column as the primary key to the table. The way to assign a value to this row_id hidden column is as follows:

  • row_idThe server will maintain a global variable in memory. Whenever a record is inserted into a table containing hidden columns, the value of the variable will be regarded as row_idthe value of the column of the new record, and the variable will be incremented automatically.1
  • Whenever the value of this variable 256is a multiple of , the value of the variable will be refreshed to 表空间的页号为7的页面中an Max Row IDattribute called the system (we mentioned it in detail when we introduced the table space structure earlier)
  • When the system starts, it will load the attributes mentioned above Max Row IDinto memory, add the value and 256assign it to the global variable we mentioned earlier (because the value of the global variable may be greater than the Max Row IDattribute value at the last shutdown)

Max Row IDThe storage space occupied by this attribute is 8bytes. When a transaction row_idinserts a record into a table containing hidden columns, and assigns a row_idvalue to the record that 256is a multiple of Write the 8-byte value at the corresponding offset of the page. But we need to know that this write is actually Buffer Pooldone in , we need to record a redolog for the modification of this page, so that after the system crashes, the modification made by the transaction that has been committed to the page can be restored. In this case, the modification of the page is extremely simple. redoIn the log 只需要记录一下在某个页面的某个偏移量处修改了⼏个字节的值,具体被修改的内容是啥就好了, InnoDBthis extremely simple redolog is called 物理日志, and there are several different redo log types according to the amount of data written in the page:

  • MLOG_1BYTE( typeThe decimal number corresponding to the field is ): Indicates the log type 1that writes 1 byte at a certain offset of the page .redo
  • MLOG_2BYTE( typeThe decimal number corresponding to the field is 2): Indicates the log type that writes 2 bytes at a certain offset of the page redo.
  • MLOG_4BYTE( typeThe decimal number corresponding to the field is 4): Indicates the log type that writes 4 bytes at a certain offset of the page redo.
  • MLOG_8BYTE( typeThe decimal number corresponding to the field is ): Indicates the log type 8that writes 8 bytes at a certain offset of the page .redo
  • MLOG_WRITE_STRING( typeThe decimal number corresponding to the field is 30): Indicates that a string of data is written at a certain offset of the page.

The attribute we mentioned above Max Row IDactually occupies 8 bytes of storage space, so when modifying this attribute on the page, a log of type will be recorded MLOG_8BYTE, redoand MLOG_8BYTEthe redolog structure is as follows:

insert image description here
The log structure of the other MLOG_1BYTE, MLOG_2BYTE, and types is similar to that of , except that the specific data contains corresponding bytes of data. The type of log indicates that a string of data is written, but because it is not possible to determine how many bytes the specific data written occupies, a field needs to be added to the log structure :MLOG_4BYTEredoMLOG_8BYTEMLOG_WRITE_STRINGredolen

insert image description here

小提示:
As long as the len field of the MLOG_WRITE_STRING type redo log is filled with the numbers 1, 2, 4, and 8, it can replace the redo logs of MLOG_1BYTE, MLOG_2BYTE, MLOG_4BYTE, and MLOG_8BYTE respectively. Why do we need to design so many types? It’s not because of saving space. If you can write the len field without writing the len field, one byte saved counts as one byte.

2.2 More complex redo log types

Sometimes executing a statement will modify many pages, including system data pages and user data pages (user data refers to 聚簇索引and ⼆级索引corresponds to B+树). Take a INSERTstatement as an example. In addition to B+inserting data into the tree page, it may also update Max Row IDthe value of system data. However, for our users, we usually care more about B+树the update made by the statement:

  • How many indexes are included in the table, and how many B+ trees may be updated by one INSERT statement
  • For a certain B+ tree, it is possible to update the leaf node page, update the internal node page, or create a new page (when the remaining space of the leaf node where the record is inserted is relatively small and not enough to store the record, The page will be split, and directory entry records will be added to the internal node page)

During the execution of the statement, INSERTthe modification of all pages by the statement must be saved in the server redo日志. This sentence is relatively easy to say, but it is more troublesome to do it. For example, when inserting a record into the clustered index, if the remaining space of the located leaf node is enough to store the record, then only updating the page of the leaf node is enough. Ok, so just record one MLOG_WRITE_STRINGtype of redolog, indicating what data has been added at a certain offset on the page? That's too naive~ Don't forget that in addition to storing the actual records, there are other parts such as F ile Header, Page Header, Page Directoryand so on in a data page (details are explained in the chapter on learning data pages), so every data page represented by a leaf node When a record is inserted in , there are many other places that will be updated, such as:

  • Page DirectorySlot information in possible updates .
  • Page HeaderVarious page statistics in , PAGE_N_DIR_SLOTSthe number of slots represented may change, PAGE_HEAP_TOPthe minimum address of the unused space represented may change, PAGE_N_HEAPthe number of records represented in this page may change, and various information may be modified .
  • We know that the records in the data page form a one-way linked list according to the order of the index columns from small to large. Every time a record is inserted, the attributes in the record header information of the previous record need to be updated to maintain this one-way linked list next_record.
  • There are other places to update, so I won’t nag…

Draw a simple schematic like this:

insert image description here
Having said so much, I just want to express: 把一条记录插入到一个页面时需要更改的地方非常多. At this time, if we use the simple physical redo log introduced above to record these modifications, there are two solutions:

  • Solution 1: Record a log for each modification redo. That is, as shown in the figure above, write as many physical redo logs as there are bold blocks. The disadvantage of recording redo logs in this way is obvious, because there are too many modified places, and the space occupied by the recorded redo logs may be more than the space occupied by the entire page~
  • 第一个被修改的字节到最后一个修改的字节Solution 2: Treat all the data between the entire page as specific data in a physical redo log. It can also be seen from the figure that there is still a lot of unmodified data between the first modified byte and the last modified byte. Wouldn’t we add these unmodified data to the redo log? What a waste~

Because the above two redo日志methods of using physical methods to record what changes have been made to a certain page are rather wasteful, when designing InnoDB, with the original intention of thrift and frugality, some new redo log types are proposed, such as:

  • MLOG_REC_INSERT(The corresponding decimal number is 9): Indicates the redo log type when inserting a record that uses a non-compact row format.

  • MLOG_COMP_REC_INSERT(The corresponding decimal number is 38): Indicates the redo log type when inserting a record using the compact row format.

    小提示:
    Redundant is a relatively primitive row format, which is not compact. The Compact, Dynamic, and Compressed row formats are newer row formats that are compact (take up less storage space)

  • MLOG_COMP_PAGE_CREATE(The decimal number corresponding to the type field is 58): Indicates the redo log type that creates a page that stores records in compact row format.

  • MLOG_COMP_REC_DELETE(The decimal number corresponding to the type field is 42): Indicates to delete a redo log type recorded in compact row format.

  • MLOG_COMP_LIST_START_DELETE(The decimal number corresponding to the type field is 44): Indicates to delete a series of redo log types recorded in the compact row format from a given record on the page.

  • MLOG_COMP_LIST_END_DELETE(The decimal number corresponding to the type field is 43): MLOG_COMP_LIST_START_DELETEIt corresponds to the redo log of the type, indicating to delete a series of records until MLOG_COMP_LIST_END_DELETEthe record corresponding to the redo log of the type.

    小提示:
    When we talked about the InnoDB data page format, we emphasized that the records in the data page form a one-way linked list in the order of the index column size. Sometimes we need to delete all records whose index column values ​​are within a certain range. At this time, if we write a redo log every time we delete a record, the efficiency may be a bit low, so we propose MLOG_COMP_LIST_START_DELETE and MLOG_COMP_LIST_END_DELETE types of redo Logs can greatly reduce the number of redo logs.

  • MLOG_ZIP_PAGE_COMPRESS(The decimal number corresponding to the type field is 51): Indicates the redo log type for compressing a data page.

  • ······There are many, many types, so I won't list them here, I will talk about them when I use them~

These types of redo logs include both physical and logical meanings, specifically:

  • At the physical level, these logs indicate which page in which tablespace has been modified.
  • From a logical perspective, when the system crashes and restarts, it is not possible to directly restore a certain offset in the page to a certain data based on the records in these logs. Instead, it needs to call some pre-prepared functions. Only after these functions can the page be restored to the way it was before the system crashed.

You may see that this may be a bit confusing. Let’s take the MLOG_COMP_REC_INSERTredo log when a record using the compact row format is inserted for this type as an example to understand what we mean by the physical level and logical level we mentioned above. Let's not talk nonsense, just look at MLOG_COMP_REC_INSERTthe structure of this type of redo log (because there are too many fields, it is better to see them vertically):

insert image description here
MLOG_COMP_REC_INSERTThere are several places in this type of redo log structure that need your attention:

  • We said earlier when we were learning about indexes that in a data page, whether it is a leaf node or a non-leaf node, the records are sorted in ascending order of the index columns. For secondary indexes, when the values ​​of the index columns are the same, the records also need to be sorted according to the primary key value. The meaning of the value in the figure n_uniquesis that in a record, the values ​​of several fields are required to ensure the uniqueness of the record, so that when a record is inserted, it can be n_uniquessorted according to the previous field of the record. For clustered indexes, n_uniquesthe value is the number of columns in the primary key, and for other secondary indexes, the value is the 索引列数+主键列number of columns. It should be noted here that the value of the unique secondary index may be NULL, so the value is still 索引列数+主键列数.

  • field1_len ~ fieldn_lenRepresents the size of the storage space occupied by several fields of the record. It should be noted that, regardless of whether the type of the field is fixed-length (for example) INTor variable-length (for example VARCHAR(M)), the size occupied by the field Always write to the redo log.

  • offsetIt represents the address of the previous record of this record in the page. Why record the address of the previous record? This is because every time you insert a record into a data page, you need to modify the record list maintained in the page. The record header information of each record contains an attribute called , so when inserting a new record, you need to modify the previous next_recordrecord properties next_record.

  • We know that a record is actually composed of 额外信息two 真实数据parts, and the total size of these two parts is the total size of the storage space occupied by a record. The passed end_seg_lenvalue can indirectly calculate the total size of the storage space occupied by a record. Why not directly store the total size of the storage space occupied by a record? This is because writing redoa log is a very frequent operation. InnoDBTrying to reduce redothe storage space occupied by the log itself, so I thought of some convoluted algorithms to achieve this goal. end_seg_lenThis field is proposed to save the storage space of the redo log of.

  • mismatch_indexThe value of is also set to save the size of the redo log, you can ignore it.

Obviously, this type of MLOG_COMP_REC_INSERTlog redodoes not record PAGE_N_DIR_SLOTSwhat the value was modified for, PAGE_HEAP_TOPwhat the value was modified for, PAGE_N_HEAPwhat the value was modified for, etc., but just write down all the necessary elements for inserting a record in this page , when the system crashes and restarts later, the server will call the function related to inserting a record to a page, and redothe data in the log can be regarded as the parameters required to call this function. After calling the function, the page The values ​​of PAGE_N_DIR_SLOTS, PAGE_HEAP_TOP, PAGE_N_HEAPand so on in will be restored to the state before the system crashed. This is what is meant by a so-called logical log.

2.3 Summary of redo log format

Although a lot of content about the log format has been mentioned above redo, if you are not writing a tool for parsing redo logs or developing a redo log system yourself, then there is no need to convert various types of redo logs in InnoDB The format is thoroughly studied, there is no need for that. Above I just introduced several types of redo log formats symbolically, the purpose is to let everyone understand: redo日志会把事务在执行过程中对数据库所做的所有修改都记录下来,在之后系统奔溃重启后可以把事务所做的任何修改都恢复出来.

小提示:
In order to save the storage space occupied by redo logs, the uncle who designed InnoDB may also compress some data in redo logs. For example, spacd ID and pagenumber generally occupy 4 bytes for storage, but after compression, they may be used Less space to store. I won't talk about the specific compression algorithm.

3. Mini-Transaction

3.1 Write redo logs in the form of groups

A statement may modify several pages during execution. For example, a statement we mentioned earlier INSERTmay modify the attributes of the page whose page number is 7 in the system tablespace Max Row ID(of course, it may also update other system pages, but we have not listed them all), and will also 更新聚簇索引和二级索引对应B+树update the pages. Since the changes to these pages all occur in Buffer Pool, after modifying the pages, you need to record the corresponding redologs. The log generated during the execution of the statement redois artificially divided into several by the uncle who designed InnoDB 不可分割的组, such as:

  • Max Row IDThe redo log generated when updating attributes is indivisible.
  • 聚簇索引对应B+树的页面中插入一条记录The redo log generated in the forward time is indivisible .
  • 某个二级索引对应B+树的页面中插入一条记录The redo log generated in the forward time is indivisible .
  • There are other redo logs generated during page access operations that are inseparable. . .

How to understand the meaning of indivisibility? Let's take inserting a record into the B+ tree corresponding to an index as an example. Before inserting this record into the B+ tree, we need to locate the data page represented by the leaf node where this record should be inserted, and locate the specific After the data page, there are two possible situations:

  • Situation 1: The remaining free space of the data page is sufficient to accommodate the record to be inserted, then the matter is very simple, just insert the record into the data page directly, and record a redo log of type , we put MLOG_COMP_REC_INSERTthis This situation is called 乐观插入. If the B+ tree corresponding to an index looks like this:

    insert image description hereNow we want to insert a 10record with a key value, which obviously needs to be inserted into 页b. Since the page now has enough space to accommodate a record, it is good to binsert the record directly into the page , like this:b
    insert image description here

  • Situation 2: The remaining free space of the data page is insufficient, then things will be tragic. As we said before, in this case, we need to perform the so-called, that is, create a new leaf node, and then 页分裂操作copy some records in the original data page To this new data page, then insert the record, insert this leaf node into the leaf node linked list, and finally add a 一条目录项记录point to this newly created page in the inner node. Obviously, this process needs to modify multiple pages, which means that multiple redologs will be generated, which we call this situation 悲观插入. If the B+ tree corresponding to an index looks like this:

    insert image description here

    Now we want to insert a 10record with a key value, which obviously needs to be inserted into 页b, but it can also be seen from the figure that at this time 页b已经塞满了记录, there is no more free space to accommodate this new record, so we need to perform page The split operation, like this:

    insert image description here
    If 页athe remaining free space as an internal node is not enough to accommodate the increase 加一条目录项记录, then you need to continue to be an internal node 页a的分裂操作, which means that more pages will be modified, thereby generating more redologs. In addition, 悲观插入because of the need to apply for new data pages, some system pages need to be modified, for example, to modify the statistical information of various segments and areas, and the statistical information of various linked lists (such as what are we talking about the table space FREE链表、FSP_FREE_FRAG链表? All kinds of things introduced in that chapter) and so on, anyway, there are 20 or 30 redo logs that need to be recorded.

    小提示:
    In fact, not only pessimistically inserting a record will generate many redo logs, but also for some other functions when designing InnoDB, multiple redo logs may also be generated when optimistically inserting (we won’t say more about the specific functions, otherwise the space I can't take it anymore~).

When designing InnoDB, it is considered that the process of inserting into the B+ tree corresponding to a certain index 一条记录must be atomic, and it cannot be said that it will stop after half of the insertion. For example, in the pessimistic insertion process, the new page has been allocated, the data has been copied, and the new record has been inserted into the page, but it has not been inserted into the inner node. This insertion process is incomplete, which 一条目录项记录will Form an incorrect B+ tree. We know that redothe log is to restore the state before the crash when the system crashes and restarts. If only a part of the redo log is recorded during the pessimistic insertion process, then the B+ tree corresponding to the index will be restored to an incorrect state when the system crashes and restarts. state, which is unbearable when InnoDB was designed. Therefore, they stipulate that when performing these operations that need to ensure atomicity, the redo logs must be recorded in the form of groups. When performing system crash restart recovery, 针对某个组中的redo日志,要么把全部的日志都恢复掉,要么一条也不恢复. How did you do it? This score situation is discussed:

  • Some 原子性operations that need to be guaranteed will generate multiple redologs. For example, a pessimistic insertion into the B+ tree corresponding to an index needs to generate many redo logs. How to divide these redo logs into a group? When designing InnoDB, I made a very simple trick, which is to add a special type of redolog after the last redo log in the group. The type name is MLOG_MULTI_REC_END, , and the log structure type字段对应的十进制数字为31of this type is very simple, with only one field: so A series of redo logs generated by an operation that needs to be guaranteed to be atomic must end with a type, like this:redotype
    insert image description here
    MLOG_MULTI_REC_END

    insert image description here

  • In this way, when the system crashes and restarts to recover, only when the MLOG_MULTI_REC_ENDredo log of type is parsed, it is considered that a complete set of redo logs has been parsed, and the recovery will be performed. Otherwise, just give up the redo log parsed earlier.

  • Some operations that need to ensure atomicity only generate one redolog. For example, Max Row IDthe operation of updating attributes only generates one redo log.

    In fact, MLOG_MULTI_REC_ENDit is also possible to follow a log with a type of redo log, but InnoDBit is more thrifty, and they don't want to waste a bit. Don’t forget that although redothere are many types of logs, there are dozens of them, which are smaller than 127this number. That is to say, we use 7 bits to cover all redolog types, and the type field actually occupies 11 word. In other words, we can save a bit to indicate that the operation that needs to ensure atomicity only generates a single redolog, as shown in the schematic diagram:

    insert image description here

  • If typethe first bit of the field is 1, it means that the operation that needs to ensure atomicity only generates a single redo log, otherwise it means that the operation that needs to ensure atomicity generates a series of redo logs.

3.2 The concept of Mini-Transaction

The process of MySQL’s atomic access to the underlying page is called one Mini-Transaction, for short mtr. For example, the value modified once mentioned above Max Row IDis one Mini-Transaction, and the process of inserting a record into the B+ tree corresponding to an index is also one Mini-Transaction. Through the above description, we also know that a so-called mtr可以包含一组redo日志,在进行奔溃恢复时这一组redo日志作为一个不可分割的整体.

A transaction can contain several statements, and each statement is actually composed of several statements mtr, each of which mtrcan contain them 若干条redo日志. Draw a picture to show their relationship like this:

insert image description here

4. The writing process of redo log

4.1 redo log block

InnoDBIn order to better recover from system crashes during design , they put the logs mtrgenerated by the pass in pages with a size of one byte. In order to distinguish it from the pages in the table space we mentioned earlier, we call the pages used to store logs here (you know the meaning of pages and blocks is almost the same). A schematic diagram is as follows:redo512redoblockredo log block

insert image description here
The real redologs are all stored in 496the size of the bytes , and the and stored log block bodyin the figure are some management information. Let's take a look at what these so-called management information are:log block headerlog block trailer

insert image description here
The meanings of several of these log block headerproperties are as follows:

  • LOG_BLOCK_HDR_NO: Each block has a unique label greater than 0, and this attribute indicates the value of the label.

    This attribute is blockassigned when it is first used, and lsnis related to the system value at that time. Use the following formula to calculate the blockvalue LOG_BLOCK_HDR_NO: ((lsn / 512) & 0x3FFFFFFFUL) + 1
    this formula 0x3FFFFFFFULmay confuse everyone, but its binary representation may be more friendly:

    insert image description hereIt can be seen from the figure that 0x3FFFFFFFULthe first 2 bits of the corresponding binary number are 0, and the values ​​of the last 30 bits are all 1. When we first started learning computers, we learned that (&)the result of an AND operation between a binary bit and 0 is definitely 0, and (&)the result of an AND operation between a binary bit and 1 is the original value. To make an AND 0x3FFFFFFFULoperation means to set the value of the first 2 bits of the value to 0, so that the value must be less than or equal to it 0x3FFFFFFFUL. This also shows that no matter how large lsn is, ((lsn / 512) & 0x3FFFFFFFUL)the value of lsn must be 0~0x3FFFFFFFULbetween, and if you add 1, it must be 1~0x40000000ULbetween. And 0x40000000ULthis value should be familiar to everyone, this value represents 1GB. That is to say, the system can generate at most only one unique LOG_BLOCK_HDR_NOvalue 1GB. The design InnoDBstipulates redothat the total size of all files contained in the log file group shall not exceed 512GB, and one blocksize is 512bytes, that is to say, the maximum number of blocks redocontained in the log file group is , so a non-repeating number value is enough.block1GB1GB

    In addition, LOG_BLOCK_HDR_NOthe first bit of the value is special, so flush bitif the value is 1, it means that this block is the first to be flushed in an operation that flushes the block in the log buffer to the disk block.

  • LOG_BLOCK_HDR_DATA_LEN: Indicates how many bytes have been used in the block 初始值为12(because the log block body starts from the 12th byte). As more and more redo logs are written to the block, the value of this attribute also increases. iflog block body已经被全部写满,那么本属性的值被设置为512

  • LOG_BLOCK_FIRST_REC_GROUP: A redo log can also be called a redo log record ( redo logrecord), and one mtr will produce multiple redo log records, and these redo log records are called 一个redo日志记录组( redo log record group) 。LOG_BLOCK_FIRST_REC_GROUPto represent the first mtrgenerated redolog record group in the block Offset (in fact, it is the offset of blockthe first log generated by the first mtr here ).redo

  • LOG_BLOCK_CHECKPOINT_NO: Indicates the so-called checkpointserial number, checkpointwhich is the focus of our follow-up content. Don’t need to clarify its meaning now, so don’t be impatient.

log block trailerThe meanings of the attributes are as follows:

  • LOG_BLOCK_CHECKSUM: Indicates the verification value of the block, which is used for correctness verification, and we don't care about it for the time being

4.2 redo log buffer

As we said before, InnoDB was designed to solve the problem 磁盘速度过慢的问题而引入了Buffer Pool. In the same way, 写入redo日志时也不能直接直接写到磁盘上,实际上在服务器启动时就向操作系统申请了一大⽚称之为redo log buffer的连续内存空间, translated into Chinese is redo日志缓冲区, we can also call it for short log buffer. This piece of memory space is divided into several contiguous ones redo log block, like this:

insert image description here
We can innodb_log_buffer_sizespecify log bufferthe size through the startup parameter, the default value of the startup parameter is 16MB.

mysql> show variables like 'innodb_log_buffer_size';
+------------------------+----------+
| Variable_name          | Value    |
+------------------------+----------+
| innodb_log_buffer_size | 16777216 |
+------------------------+----------+
1 row in set (0.01 sec)

Large log buffers enable large transactions to run without writing the log to disk before the transaction commits. Therefore, if you have transactions that update, insert, or delete many rows, increasing the log buffer can save disk I/O.

mysql> set persist innodb_log_buffer_size =33554432;
Query OK, 0 rows affected (0.04 sec)

4.3 Redo log is written to log buffer

log bufferThe process of writing logs to redothe medium is sequential, that is, blockwrite to the previous medium first, and then write blockto the next medium when the free space of the corresponding medium is used up . blockWhen we want to write logs log bufferto , so we specially provide a global variable called , which indicates where the subsequent redo logs should be written to, as shown in the figure:redo第一个遇到的问题就是应该写在哪个block的哪个偏移量处InnoDBbuf_freelog buffer

insert image description here
We said earlier that mtrseveral redologs may be generated during an execution process, and these redologs are an inseparable group, so in fact, redoit is not inserted into the log every time a log is generated logbuffer, but mtrthe logs generated during each running process are first Temporarily store it in one place, and when it is mtrtime to end, redocopy all the logs generated during the process to log buffer. Let's now assume that there are two transactions named T1, T2each of which includes 2个mtr, let's name these mtrs:

  • 事务T1The two mtrare called mtr_T1_1andmtr_T1_2
  • 事务T2The two mtrare called mtr_T2_1andmtr_T2_2

Each mtrwill generate a set of redologs, and use a schematic diagram to describe the mtrgenerated logs:

insert image description hereDifferent transactions may be 并发执行yes, so the mtr between T1 and T2 may be 交替执行yes. 每当一个mtr执行完成时,伴随该mtr生成的一组redo日志就需要被复到log buffer 中, that is to say, the mtr of different transactions may be written alternately log buffer. Let's draw a schematic diagram (for the sake of beauty, we draw all the redo logs generated in one mtr as a whole):

insert image description hereFrom the schematic diagram, we can see that mtrthe storage space occupied by different sets of redo logs may be different. Some mtrs generate a small amount of redo logs, while some mtrs generate a very large amount of redo logs.

Five, redo log

5.1 Timing of flushing redo logs

We said earlier that a set of logs mtrgenerated during the running process will be copied to the server at the end , but it is not a good idea to keep these logs in memory. In some cases, they will be flushed to disk, for example:redomtrlog buffer

  • log buffer空间不足时: The size of the log buffer is limited ( innodb_log_buffer_sizespecified by system variables), if you keep adding logs to this limited size log buffer, it will be filled soon. InnoDBIt is considered that if it is current 写入log buffer的redo日志量已经占满了log buffer总容量的大约一半左右, these logs need to be flushed to disk.
  • 事务提交时: We said earlier that the main reason for using redothe log is because it takes up less space, and it is still written sequentially. When the transaction is committed, the 可以不把modified Buffer Poolpages are flushed to the disk. However, in order to ensure persistence, the redo logs corresponding to the modified pages must be Flush to disk.
  • 后台线程不停的刷:
    There is a Master Threadthread in the background, which flushes the logs to disk about once every log buffersecond redo.
  • When shutting down the server gracefully
  • When doing the so-called checkpoint(we haven't introduced the concept of checkpoint now, we will talk about it carefully later, don't be impatient)
  • Some other situations...

5.2 redo log file group

MySQLSHOW VARIABLES LIKE 'datadir'By default, there are two files named ib_logfile0and in the data directory (use view), and the logs in are refreshed to these two disk files by default. If we are not satisfied with the default log file, we can adjust it through the following startup parameters:ib_logfile1log bufferredo

  • innodb_log_group_home_dir: This parameter specifies the directory where the redo log file is located, and the default value is the current data directory.
  • innodb_log_file_size: This parameter specifies the size of each redo log file, the default value is48MB
  • innodb_log_files_in_group: This parameter specifies redothe number of log files, 默认值为2, 最大值为100.

redoAs can be seen from the above description, there is not only one log file on the disk , but 一个日志文件组in the form of . these files 以ib_logfile[数字](数字可以是0、1、2...)的形式进行命名. When writing the redo log to the log file group, it is ib_logfile0written from the beginning. If ib_logfile0it is full, it will continue to ib_logfile1write. Similarly, if ib_logfile1it is full, it will be written ib_logfile2, and so on. What if the last file is written? Then go back to ib_logfile0continue writing, so the whole process is shown in the figure below:

insert image description here
The total redolog file size is actually:innodb_log_file_size × innodb_log_files_in_group

小提示:
If the data is written to the redo log file group in a circular way, wouldn't it be tail-end, that is, the redo log written later will overwrite the redo log written earlier? Of course it is possible! So checkpointthe concept proposed by InnoDB, we will focus on explaining it later~

5.3 redo log file format

We said earlier that log bufferit is essentially a continuous memory space, which is divided into several 512byte sizes block. The essence of refreshing the log in log bufferthe disk redoto the disk is to blockwrite the image of the log file into the log file, so redothe log file is actually composed of several 512bytes in size block. redoEach file in the log file group is the same size and format, and consists of two parts:

  • The first 2048byte, that is, the former 4个blockis used to store some management information
  • 从第2048字节往后is used for log buffer中的block镜像storage

So the circular use of redo log files we mentioned earlier is actually calculated from the 2048th byte of each log file. Draw a schematic diagram like this:

insert image description here
We have already mentioned the common blockformat when we were nagging , that is , the three parts of , , and , so we won’t repeat the introduction. Here we need to introduce each one , that is, what are the formats of the first 4 special blocks, let’s not talk nonsense, let’s look at the picture first:log bufferlog block headerlog block bodylog blocktrialerredo日志文件前2048个字节

insert image description here
As can be seen from the figure, the four blocks are:

log file header : Describe some overall properties of the redo log file, let's take a look at its structure:

insert image description here

The specific interpretation of each attribute is as follows:

attribute name Length (unit: byte) describe
LOG_HEADER_FORMAT 4 The version of the redo log, the value is always 1
LOG_HEADER_PAD1 4 It is used for byte filling, it has no practical significance, ignore~
LOG_HEADER_START_LSN 8 Mark the LSN value at the beginning of this redo log file, that is, the LSN value corresponding to the beginning of the file offset of 2048 bytes (we will look at what LSN is later, ignore it if you don’t understand it)
LOG_HEADER_CREATOR 32 A string identifying who the creator of this redo log file is. This value is the version number of MySQL during normal operation, for example: "The value of the redo log file created by MySQL using the mysqlbackup command is "ibbackup" and the creation time.
LOG_BLOCK_CHECKSUM 4 The check value of this block, all blocks have it, we don't care

小提示:
InnoDB has modified the block format of the redo log many times. If you find that the above attributes are different from the attributes in the books you read in other books, don’t panic. This is normal. In addition, we will introduce the LSN value later. , Now don't worry about what LSN is.

checkpoint1: record some attributes about checkpoint, look at its structure:

insert image description here
The specific interpretation of each attribute is as follows:

attribute name Length (unit: byte) describe
LOG_CHECKPOINT_NO 8 The server checkpoint number, every time a checkpoint is done, the value is increased by 1.
LOG_CHECKPOINT_LSN 8 The corresponding LSN value at the end of the server checkpoint. When the system crashes and recovers, it will start from this value.
LOG_CHECKPOINT_OFFSET 8 The offset of the LSN value in the previous attribute in the redo log file group
LOG_CHECKPOINT_LOG_BUF_SIZE 8 The size of the corresponding logbuffer when the server performs checkpoint operations
LOG_BLOCK_CHECKSUM 4 The check value of this block, all blocks have it, we don't care

小提示:
It is normal to not understand the above explanations about the attributes of checkpoint and LSN. I just want everyone to be familiar with the above attributes, and we will talk about them in detail later.

The third block : unused, ignore~

checkpoint2 : The structure is the same as checkpoint1

六、Log Sequeue Number

Since the system starts running, the page is constantly being modified, which means that redologs are constantly being generated. redoThe amount of logs is constantly increasing, just like the age of a person, it has been increasing since birth, and it can never be reduced. InnoDBIn order to record the amount of logs that have been written redo, a global variable is designed Log Sequeue Number, which translates to: 日志序列号,简称lsn. However, unlike the birth age of a person who is 0 years old, the uncle who designed InnoDB stipulated the initial lsn value 8704(that is, when a redo log has not been written, the lsn value is 8704).

We know that when log bufferwriting logs to redothe log, it is not written one by one, but written in units of a mtrgenerated set of logs. redoAnd in fact, the log content is written at logblock body. However, when counting the growth amount, it is calculated lsnbased on the actual written log amount plus the occupied sum log block header. log block trailerLet's look at an example:

  • When the system is initialized after the first startup log buffer, ( the variable that marks the location where buf_freethe next redolog should be written ) will point to the first place where the offset is bytes ( size), and the lsn value will follow increase by 12:log bufferblock12log block header

    insert image description here

If the storage space occupied by a mtrgenerated set redoof logs is relatively small, that is, when the remaining free space of the block to be inserted can accommodate the mtrsubmitted log, lsnthe increase amount is the number of bytes occupied by the mtrgenerated log, like this:redo

insert image description here

  • We assume that the amount of logs mtr_1generated in the above figure redois 200bytes, then lsnit will be 8716increased on the basis of 200and becomes 8916.

  • If the storage space occupied by mtra set of generated logs is relatively large, that is, when the remaining free space redoto be inserted is not enough to accommodate the submitted logs, the increase will be the number of bytes occupied by the generated logs plus the additional occupied and bytes, like this:blockmtrlsnmtrredolog block headerlog block trailer

    insert image description here

  • We assume that the amount of logs mtr_2generated in the above figure is bytes. In order to write the generated logs , we have to allocate two more , so the value of needs to be increased on the basis ofredo1000mtr_2redolog bufferblocklsn89161000 + 12×2 + 4 × 2 = 1032

    小提示:
    Why is the initial lsn value 8704? I don't know too well, that's how people stipulate. In fact, you can also stipulate that you are counted as one year old when you are born, as long as you ensure that your age continues to grow as time goes by.

As can be seen from the description above, 每一组由mtr生成的redo日志都有一个唯一的LSN值与其对应,LSN值越小,说明redo日志产生的越早.

6.1 flushed_to_disk_lsn

redoThe log is first written log bufferto and then flushed to redothe log file on disk. So InnoDBcame up with a buf_next_to_writeglobal variable called, tag 当前log buffer中已经有哪些日志被刷新到磁盘中了. Draw a picture to show that it is like this:

insert image description hereWe said earlier lsnthat it indicates the amount of logs written in the current system redo, which includes log bufferlogs that are written but not flushed to disk. Correspondingly, InnoDB proposes a redoglobal variable that represents the amount of logs flushed to disk, called it flushed_to_disk_lsn. When the system starts for the first time, the value of this variable is the same as the initial lsn value, which is 8704. As the system runs, redothe log is continuously written log buffer, but it is not immediately flushed to the disk, and the value of lsn and flushed_to_disk_lsnthe value of lsn widen the gap. Let's demonstrate:

  • After the system starts for the first time, the three logs generated by , , and log bufferare written to it . Assume that the corresponding values ​​at the beginning and end of these three mtrs are:mtr_1mtr_2mtr_3mtrredolsn

    • mtr_1:8716 ~ 8916
    • mtr_2:8916 ~ 9948
    • mtr_3:9948 ~ 10000

    At this time, the lsn has grown to 10000, but because there is no refresh operation, flushed_to_disk_lsnthe value at this time is still 8704as shown in the figure:

    insert image description here
    log bufferThen perform the operation of blockflushing the log to redothe log file. Assuming that the log of mtr_1and is flushed to the disk, then the amount of logs written in and should be increased , so the value of is increased to , as shown in the figure:mtr_2flushed_to_disk_lsnmtr_1mtr_2flushed_to_disk_lsn9948

    insert image description here

To sum up, when a new redolog is written log buffer, the first lsnvalue will increase, but flushed_to_disk_lsnremain unchanged, and then as the ongoing log bufferlogs are flushed to disk, flushed_to_disk_lsnthe value will also increase. 如果两者的值相同时,说明log buffer中的所有redo日志都已经刷新到磁盘中了.

小提示:
When an application program writes a file to the disk, it actually writes it to the buffer of the operating system first. If a write operation does not return until the operating system confirms that it has been written to the disk, it needs to call the fsync function provided by the operating system. . In fact 只有当系统执行了fsync函数后, flushed_to_disk_lsnthe value of will increase accordingly, when 仅仅把log buffer中的日志写入到操作系统缓冲区却没有显式的刷新到磁盘时,另外的一个称之为write_lsn的值跟着增长. However, for the convenience of everyone's understanding, we confuse the concepts of flushed_to_disk_lsnand when talking about it .write_lsn

6.2 Correspondence between lsn value and redo log file offset

Because the value of is a sum lsnrepresenting the amount of logs written by the system , as many logs are generated in one, the value of is increased (of course, sometimes the size of the sum is added ) , so when the generated logs are written to the disk, it is easy Calculate the offset of a certain value in the log file group, as shown in the figure:redomtrlsnlog block headerlog blocktrailermtrlsnredo

insert image description here
The initial LSNvalue is 8704corresponding to the file offset 2048, and then the value will increase as mtrmany bytes of logs are written to the disk .lsn

6.3 LSN in the flush list

We know that an mtratomic access to the underlying page may generate a set of indivisible redologs during the access process, and at mtrthe end, this set redoof logs will be written to log buffer. In addition, mtrthere is another very important thing to do at the end, which is to mtradd pages that may have been modified during execution to Buffer Poolthe flushlinked list. In order to prevent everyone from forgetting flushwhat a linked list is, let's look at the picture again:

insert image description here
When modifying a cached page for the first time Buffer Pool, the control block corresponding to this page will be inserted into it flush链表的头部, and when the page is modified later, because it is already in flushthe linked list, it will not be inserted again. That is flush链表中的脏页是按照页面的第一次修改时间从大到小进行排序的. During this process, two attributes about when the page is modified will be recorded in the control block corresponding to the cache page:

  • oldest_modification: If a page is loaded and Buffer Poolmodified for the first time, then the mtrcorresponding lsn value at the beginning of modifying the page will be written into this property
  • newest_modification: Every time a page is modified, the mtrcorresponding lsnvalue at the end of modifying the page will be written into this property. That is to say, this attribute indicates the corresponding system lsn value after the page was last modified

Let's take a look at the nagging example above flushed_to_disk_lsn:

  • Assuming that mtr_1it is modified during the execution 页a, the corresponding control block mtr_1will be added to the head of the linked list at the end of the execution. And the corresponding at the beginning is written into the attribute of the corresponding control block , and the corresponding at the end is written into the attribute of the corresponding control block . Draw a picture to show it (in order to make the picture more beautiful, we put it ):页aflushmtr_1lsn8716页aoldest_modificationmtr_1lsn8916页anewest_modificationoldest_modification缩写成了o_m,把newest_modification缩写成了n_m

    insert image description here

  • Then, assuming that two pages of and mtr_2are modified during the execution , then at the end of the execution, the corresponding control blocks of and will be added to the head of the page. And write what is corresponding at the beginning , that is, write it into the attribute of the corresponding control block , and write what is corresponding at the end , that is, write it into the attribute of the corresponding control block . Draw a picture to show:页b页cmtr_2页b页cflush链表mtr_2lsn8916页b页coldest_modificationmtr_2lsn9948页b页cnewest_modification

    insert image description here

  • It can be seen from the figure that each new flushnode inserted into the linked list is placed at the head, that is to say, the flushdirty pages in the front of the linked list are modified later, and the dirty pages in the latter are modified earlier.

  • Then assume that the and mtr_3are modified during the execution process , but they have been modified before, so its corresponding control block has been inserted , so at the end of the execution, you only need to add the corresponding control blocks to the header. Therefore, it is necessary to write the corresponding at the beginning , that is, write it into the attribute of the corresponding control block , and write the corresponding at the end , that is, write it into the attribute of the corresponding control block . In addition, . Draw a picture to show:页b页d页bflush链表mtr_3页dflush链表mtr_3lsn9948页doldest_modificationmtr_3lsn10000页dnewest_modification由于页b在mtr_3执行过程中又发生了一次修改,所以需要更新页b对应的控制块中newest_modification的值为10000

    insert image description here

To sum up what I said above, it is: flush链表中的脏页按照修改发生的时间顺序进行排序,也就是按照oldest_modification代表的LSN值进行排序,被多次更新的页面不会重复插入到flush链表中,但是会更新newest_modification属性的值.

6.4 checkpoint

It is an unfortunate fact that the capacity of our redo log file group is limited, we have to choose 循环使用redo日志文件组中的文件, but this will cause the last redo log to be written and the first to be written redo日志追尾, then we should think of: redo日志只是为了系统奔溃后恢复脏页用的,如果对应的脏页已经刷新到了磁盘,也就是说即使现在系统奔溃,那么在重启后也用不着使用redo日志恢复该页面了,所以该redo日志也就没有存在的必要了,那么它占用的磁盘空间就可以被后续的redo日志所重用. That is to say: 判断某些redo 日志占用的磁盘空间是否可以覆盖的依据就是它对应的脏页是否已经刷新到磁盘里. Let's take a look at the example that has been nagging before:

insert image description here

As shown in the figure, although the generated mtr_1logs have been written to the disk, the dirty pages modified by them are still left in the disk , so the space of the logs generated by them on the disk cannot be overwritten. Then as the system runs, if it is flushed to disk, its corresponding control block will be removed from it , like this:mtr_2redoBuffer Poolredo页aflush链表

insert image description here
mtr_1The logs generated in this way redoare useless, and the disk space they occupy can be overwritten. The design InnoDBis to propose a global variable to represent the total amount of logs checkpoint_lsnthat can be overwritten in the current system , and the initial value of this variable is also the same .redo8704

For example, 页aif it is flushed to disk now, redothe log generated by mtr_1 can be overwritten, so we can perform an additional checkpoint_lsnoperation, and we call this process one time checkpoint. Doing it once checkpointcan actually be divided into two steps:

  • Step 1: Calculate the maximum value redocorresponding to the log that can be overwritten in the current systemlsn

    redoThe log can be overwritten, which means that its corresponding dirty page has been flushed to the disk. As long as we calculate the value corresponding to the earliest modified dirty page in the current system, all logs generated when the system lsn value is less than the value of the node will oldest_modificationbe It can be overwritten, we assign the dirty page to .oldest_modificationredooldest_modificationcheckpoint_lsn

    For example, if the current system 页ahas been flushed to the disk, then the flush链表tail node 页cis the first dirty page modified in the current system. Its oldest_modificationvalue is 89168916, so we assign it to checkpoint_lsn(that is to say, in the redo log corresponding to When the lsn value is less than 8916, it can be overwritten).

  • Step 2: Write the checkpoint_lsncorresponding redolog file group offset and this number into the management information (that is , or ) checkpintof the log file .checkpoint1checkpoint2

    InnoDBcheckpointIt maintains a variable of how many times the system has done so far , and the value of the variable is incremented checkpoint_noevery time it is done . We said earlier that it is easy to calculate the log file group offset corresponding to a value , so we can calculate the corresponding offset in the log file group , and then write these three values ​​to the management of the log file group information.checkpoint1lsnredocheckpoint_lsnredocheckpoint_offsetredo

    We said that each redolog file has 2048a byte of management information, but the above checkpointinformation will only be written to the management information of the first log file in the log file group. But do we store in checkpoint1or checkpoint2in? InnoDB specifies,当checkpoint_no的值是偶数时,就写到checkpoint1中,是奇数时,就写到checkpoint2中

After recording checkpointthe information, the relationship of redoeach value in the log file group lsnis like this:

insert image description here

6.5 Batch flush dirty pages from the flush list

As we Buffer Poolsaid in the introduction, under normal circumstances, the background thread is cleaning the LRUlinked list and the linked list. This is mainly because the cleaning operation is relatively slow and does not want to affect the user thread to process the request. flushHowever, if the current system modifies pages very frequently, this will lead to frequent log writing operations, and the system lsn value will increase too fast. If the background dirty page cannot be flushed out by the dirty page, the system cannot do it in time checkpoint, and it may be necessary for the user thread to flush the earliest modified dirty page ( oldest_modificationthe smallest dirty page) to the disk from the flush list synchronously, so that these dirty pages The redo log corresponding to the page is useless, and then you can do checkpointit.

6.6 View various LSN values ​​in the system

We can use SHOW ENGINE INNODB STATUScommands to view the various values InnoDB​​in the current storage engine , such as:LSN

LOG
---
mysql> SHOW ENGINE INNODB STATUS\G;
(...省略前边的许多状态)
Log sequence number          619362521
Log buffer assigned up to    619362521
Log buffer completed up to   619362521
Log written up to            619362521
Log flushed up to            619362521
Added dirty pages up to      619362521
Pages flushed up to          619362521
Last checkpoint at           619362521
Log minimum file id is       176
Log maximum file id is       189
80457 log i/o's done, 0.00 log i/o's/second
(...省略后边的许多状态)

in:

  • Log sequence number: Represents the lsn value in the system, that is, the amount of redo logs written by the current system, including the logs written in the log buffer.
  • Log flushed up to: flushed_to_disk_lsnThe value represented, that is, the amount of redo logs that the current system has written to disk.
  • Pages flushed up to: Represents the attribute value corresponding to the page that was first modified in the flush list oldest_modification.
  • Last checkpoint at: The current system checkpoint_lsnvalue.

Seven, the usage of innodb_flush_log_at_trx_commit

We said earlier that in order to ensure the transaction 持久性, the user thread needs to flush all the logs generated during the execution of the transaction redoto the disk when the transaction is committed. This requirement is too strict and will obviously reduce database performance. If some students do not have such strong requirements for transaction persistence, they can choose to modify innodb_flush_log_at_trx_committhe value of a system variable called , which has 3 optional values:

  • 0: When the value of this system variable is 0, it means that the redo log is not immediately synchronized to the disk when the transaction is committed, and this task is handed over to the background thread. This will obviously speed up the request processing, but if the server hangs up after the transaction is committed, and the background thread does not flush the redo log to the disk in time, then the modification of the page by the transaction will be lost.
  • 1: When the value of this system variable is 1, it means that the redo log needs to be synchronized to the disk when the transaction is committed, which can ensure the durability of the transaction. 1也是innodb_flush_log_at_trx_commit的认值.
  • 2: When the value of this system variable is 2, it means that the redo log needs to be written to the buffer of the operating system when the transaction is committed, but it does not need to ensure that the log is actually flushed to the disk. In this case, if the database is down and the operating system is not down, the persistence of the transaction can still be guaranteed, but if the operating system is also down, then the persistence cannot be guaranteed.

Eight, crash recovery

When the server is not hung up, redothe log is simply a big burden. Not only is it useless, but it makes the performance worse. But in case, I said in case, in case the database hangs up, the redo log is a treasure. We can restore the page to the state before the system crashed according to the records in the redo log when restarting. Let's take a closer look at what the recovery process looks like.

8.1 Determining the starting point for recovery在这里插入代码片

As we said before, the checkpoint_lsnprevious redologs can be overwritten, that is to say, the dirty pages corresponding to these redo logs have been flushed to the disk. Since they have been flushed, there is no need to restore them. For the checkpoint_lsnsubsequent redologs, their corresponding dirty pages may not have been flushed, or they may have been flushed. We cannot be sure, so we need to checkpoint_lsnread the log from the beginning redoto restore the page. Of course, there are two stored information redoin the management information of the first file in the log file group , and we certainly want to select the information that happened most recently . The information that measures the time of occurrence is the so-called . We only need to read out the value of these two and compare the size. Whichever value is larger indicates which block stores the most recent information. This way we can get the most recent corresponding value and its offset in the redo log file group .blockcheckpoint_lsncheckpointcheckpointcheckpoint_nocheckpoint1checkpoint2blockcheckpoint_nocheckpoint_nocheckpointcheckpointcheckpoint_lsncheckpoint_offset

8.2 Determining the endpoint of recovery

The starting point of redo log recovery is determined, so which is the end point? This has to start with the structure of the block. We say that when writing redo logs, they are written sequentially. After a block is filled, it will be written in the next block:

insert image description here
blockThe common log block headerpart has an LOG_BLOCK_HDR_DATA_LENattribute called , which records how many bytes of space are used in the current block. For a filled block, this value is always 512. If the value of this attribute is not 512, then it is, and it is the last block that needs to be scanned in this crash recovery.

8.3 How to recover

After determining which redologs need to be scanned for crash recovery, the next step is how to recover. Suppose there are 5 redo logs in the current redo log file, as shown in the figure:

insert image description here
Since it redo0is at checkpoint_lsnthe back, it can be ignored when recovering. We can now redoscan checkpoint_lsnthe subsequent redologs in sequence according to the order of the logs, and restore the corresponding pages according to the content recorded in the logs. There is no problem with this, but InnoDBI still think of some ways to speed up the recovery process:

  • Use the hash table to calculate the hash value
    according to r , and if there are multiple redo logs with the same space ID and page number, then use a linked list to connect them and link them in the order of generation, as shown in the figure Show:edo日志的space ID和page number属性space ID和page number相同的redo日志放到哈希表的同一个槽里

    insert image description here

  • After that, the hash table can be traversed, because the redo logs that modify the same page are placed in one slot, so one page can be repaired at one time (avoiding a lot of random IO for reading pages), which can speed up recovery speed. Another thing to note is that the redo logs of the same page are sorted in the order of generation time, so they are also restored in this order during recovery. If they are not sorted in the order of generation time, errors may occur. For example, the original modification operation is to insert a record first, and then delete the record. If the restoration is not performed in this order, it may change to delete a record first, and then insert a record, which is obviously wrong.

  • Skip the pages that have been flushed to the disk
    . As we said before, the dirty pages corresponding to checkpoint_lsnthe previous redologs must have been flushed to the disk, but we cannot be sure whether checkpoint_lsnthe subsequent redologs have been flushed to the disk, mainly because after the latest checkpointlog , the background thread may continue to flush some dirty pages out of the Buffer Pool from the LRU linked list and the flush linked list. For these checkpoint_lsnsubsequent redo logs, if their corresponding dirty pages have been flushed to disk when the crash occurs, then there is no need to modify the page according to the content of the redo log during recovery.

Then how do you know redowhether the dirty pages corresponding to a certain log have been flushed to disk when the crash occurs during recovery? This has to start with the structure of the page. As we said earlier, each page has a File Headerpart called , and there Headeris an FIL_PAGE_LSNattribute called , which records the corresponding lsn value when the page was last modified. (In fact, it is the value in the page control block newest_modification). If checkpointa dirty page is flushed to the disk after a certain time, then the FIL_PAGE_LSNlsn value corresponding to the page must be greater than the checkpoint_lsnvalue of , any page that meets this situation does not need to repeatedly execute the log lsnwith a value FIL_PAGE_LSNless redothan Further improved the speed of crash recovery.

Guess you like

Origin blog.csdn.net/liang921119/article/details/130883489