foreword
This article and the next few articles will frequently use various basic knowledge such as InnoDB记录行格式
, 页面格式
, 索引原理
, and 表空间的组成
so on. If you don’t understand these things thoroughly, then reading the following text may be particularly laborious. In order to ensure your normal understanding, please make sure that you have mastered the knowledge we have learned before.
Table of contents
1. What is the redo log?
We know that InnoDB
the storage engine manages the storage space 页
on a unit-by-unit basis, and the addition, deletion, modification, and query operations we perform are actually accessing pages (including operations such as reading pages, writing pages, and creating new pages). We Buffer Pool
said in the previous study that before actually accessing the page, the page on the disk needs to be cached in the memory Buffer Pool
before it can be accessed. But when I 事务
was studying, I emphasized a so-called 持久性
feature, that is, for a committed transaction, even if the system crashes after the transaction is committed, the changes made by the transaction to the database cannot be lost. Buffer Pool
But if we only modify the page in the memory, assuming that a failure suddenly occurs after the transaction is committed, causing the data in the memory to become invalid, then the changes made to the database by the committed transaction will also follow Lost, that's something we can't stand. So how to ensure this 持久性
? A very simple approach is to flush all pages modified by the transaction to disk before the transaction is committed, but this simple and crude approach has some problems:
-
Refreshing a complete data page is too wasteful.
Sometimes we only modify a byte in a page, but we know that it isInnoDB中是以页为单位
going to diskIO
, that is to say, we have to change a complete page from Refresh from the memory to the disk. We also know that a page is 16KB by default. It is obviously too wasteful to refresh 16KB of data to the disk after only modifying one byte. -
Random IO is relatively slow to refresh.
A transaction may contain many statements, and even one statement may modify many pages. Unfortunately, the pages modified by the transaction may not be adjacent, which means that when a transaction is modifiedBuffer Pool
When the pages in the disk are refreshed to the disk, a lot of work is required随机IO
, especially for traditional mechanical hard disks, random IO is much slower than sequential IO.
What should I do? Back to our original intention again: 我们只是想让已经提交了的事务对数据库中数据所做的修改永久生效,即使后来系统崩溃,在重启后也能把这种修改恢复出来
. So we don't actually need to flush all the pages modified by the transaction in memory to disk every time the transaction is committed. We only need, for example, 把修改了哪些东西记录一下就好
a transaction to offset page 100 in the system tablespace by To change the value of the byte at 1000 from 1 to 2, we only need to record:
Update the value at offset 1000 of page 100 in tablespace 0 to 2.
In this way, we refresh the above content to the disk when the transaction is committed. Even if the system crashes later, after restarting, just follow the steps recorded in the above content to re-update the data page, then the changes made by the transaction to the database can be restored. Recovering means meeting 持久性
the requirements. Because the data page needs to be re-updated according to the steps recorded in the above content when the system crashes and restarts, the above content is also called 重做日志
, the English name redo log
, we can also call it redo日志
. Compared with flushing all modified in-memory pages to disk when the transaction is committed, the redo
benefits of flushing only the logs generated during the execution of the transaction to disk are as follows:
- The space occupied by the redo log is very small.
The storage space required to store the tablespace ID, page number, offset and the value that needs to be updated is very small.redo
We will talk about the format of the log in detail later. Now we only need to know one redo It would be nice if the log doesn't take up too much space. redo
The log is顺序写入
on the disk.
In the process of executing a transaction, each time a statement is executed, several redo logs may be generated. These logs are written to the disk in the order they are generated, that is, sequential IO is used.
Two, redo log format
From the above content, we know that redo
the log essentially just records what modifications the transaction made to the database. Multiple types of logs InnoDB
are defined for different modification scenarios of transactions to the database , but most types of logs have the following general structure:redo
redo
The detailed explanation of each part is as follows:
type
: The type of the redo log. InnoDBredo
has designed a total53
of different types of logs, which will be introduced in detailredo
laterspace ID
: tablespace IDpage number
: page numberdata
: The specific content of the redo log
2.1 Simple redo log type
As we mentioned earlier InnoDB的记录行格式
, if we do not explicitly define a table 主键
and the Unique key is not defined in the table, InnoDB will automatically add a so-called row_id
hidden column as the primary key to the table. The way to assign a value to this row_id hidden column is as follows:
row_id
The server will maintain a global variable in memory. Whenever a record is inserted into a table containing hidden columns, the value of the variable will be regarded asrow_id
the value of the column of the new record, and the variable will be incremented automatically.1
- Whenever the value of this variable
256
is a multiple of , the value of the variable will be refreshed to表空间的页号为7的页面中
anMax Row ID
attribute called the system (we mentioned it in detail when we introduced the table space structure earlier) - When the system starts, it will load the attributes mentioned above
Max Row ID
into memory, add the value and256
assign it to the global variable we mentioned earlier (because the value of the global variable may be greater than theMax Row ID
attribute value at the last shutdown)
Max Row ID
The storage space occupied by this attribute is 8
bytes. When a transaction row_id
inserts a record into a table containing hidden columns, and assigns a row_id
value to the record that 256
is a multiple of Write the 8-byte value at the corresponding offset of the page. But we need to know that this write is actually Buffer Pool
done in , we need to record a redo
log for the modification of this page, so that after the system crashes, the modification made by the transaction that has been committed to the page can be restored. In this case, the modification of the page is extremely simple. redo
In the log 只需要记录一下在某个页面的某个偏移量处修改了⼏个字节的值,具体被修改的内容是啥就好了
, InnoDB
this extremely simple redo
log is called 物理日志
, and there are several different redo log types according to the amount of data written in the page:
MLOG_1BYTE
(type
The decimal number corresponding to the field is ): Indicates the log type1
that writes 1 byte at a certain offset of the page .redo
MLOG_2BYTE
(type
The decimal number corresponding to the field is2
): Indicates the log type that writes 2 bytes at a certain offset of the pageredo
.MLOG_4BYTE
(type
The decimal number corresponding to the field is4
): Indicates the log type that writes 4 bytes at a certain offset of the pageredo
.MLOG_8BYTE
(type
The decimal number corresponding to the field is ): Indicates the log type8
that writes 8 bytes at a certain offset of the page .redo
MLOG_WRITE_STRING
(type
The decimal number corresponding to the field is30
): Indicates that a string of data is written at a certain offset of the page.
The attribute we mentioned above Max Row ID
actually occupies 8 bytes of storage space, so when modifying this attribute on the page, a log of type will be recorded MLOG_8BYTE
, redo
and MLOG_8BYTE
the redo
log structure is as follows:
The log structure of the other MLOG_1BYTE
, MLOG_2BYTE
, and types is similar to that of , except that the specific data contains corresponding bytes of data. The type of log indicates that a string of data is written, but because it is not possible to determine how many bytes the specific data written occupies, a field needs to be added to the log structure :MLOG_4BYTE
redo
MLOG_8BYTE
MLOG_WRITE_STRING
redo
len
小提示:
As long as the len field of the MLOG_WRITE_STRING type redo log is filled with the numbers 1, 2, 4, and 8, it can replace the redo logs of MLOG_1BYTE, MLOG_2BYTE, MLOG_4BYTE, and MLOG_8BYTE respectively. Why do we need to design so many types? It’s not because of saving space. If you can write the len field without writing the len field, one byte saved counts as one byte.
2.2 More complex redo log types
Sometimes executing a statement will modify many pages, including system data pages and user data pages (user data refers to 聚簇索引
and ⼆级索引
corresponds to B+树
). Take a INSERT
statement as an example. In addition to B+
inserting data into the tree page, it may also update Max Row ID
the value of system data. However, for our users, we usually care more about B+树
the update made by the statement:
- How many indexes are included in the table, and how many B+ trees may be updated by one INSERT statement
- For a certain B+ tree, it is possible to update the leaf node page, update the internal node page, or create a new page (when the remaining space of the leaf node where the record is inserted is relatively small and not enough to store the record, The page will be split, and directory entry records will be added to the internal node page)
During the execution of the statement, INSERT
the modification of all pages by the statement must be saved in the server redo日志
. This sentence is relatively easy to say, but it is more troublesome to do it. For example, when inserting a record into the clustered index, if the remaining space of the located leaf node is enough to store the record, then only updating the page of the leaf node is enough. Ok, so just record one MLOG_WRITE_STRING
type of redo
log, indicating what data has been added at a certain offset on the page? That's too naive~ Don't forget that in addition to storing the actual records, there are other parts such as F ile Header
, Page Header
, Page Directory
and so on in a data page (details are explained in the chapter on learning data pages), so every data page represented by a leaf node When a record is inserted in , there are many other places that will be updated, such as:
Page Directory
Slot information in possible updates .Page Header
Various page statistics in ,PAGE_N_DIR_SLOTS
the number of slots represented may change,PAGE_HEAP_TOP
the minimum address of the unused space represented may change,PAGE_N_HEAP
the number of records represented in this page may change, and various information may be modified .- We know that the records in the data page form a one-way linked list according to the order of the index columns from small to large. Every time a record is inserted, the attributes in the record header information of the previous record need to be updated to maintain this one-way linked list
next_record
. - There are other places to update, so I won’t nag…
Draw a simple schematic like this:
Having said so much, I just want to express: 把一条记录插入到一个页面时需要更改的地方非常多
. At this time, if we use the simple physical redo log introduced above to record these modifications, there are two solutions:
- Solution 1: Record a log for each modification
redo
. That is, as shown in the figure above, write as many physical redo logs as there are bold blocks. The disadvantage of recording redo logs in this way is obvious, because there are too many modified places, and the space occupied by the recorded redo logs may be more than the space occupied by the entire page~ 第一个被修改的字节到最后一个修改的字节
Solution 2: Treat all the data between the entire page as specific data in a physical redo log. It can also be seen from the figure that there is still a lot of unmodified data between the first modified byte and the last modified byte. Wouldn’t we add these unmodified data to the redo log? What a waste~
Because the above two redo日志
methods of using physical methods to record what changes have been made to a certain page are rather wasteful, when designing InnoDB, with the original intention of thrift and frugality, some new redo log types are proposed, such as:
-
MLOG_REC_INSERT
(The corresponding decimal number is 9): Indicates the redo log type when inserting a record that uses a non-compact row format. -
MLOG_COMP_REC_INSERT
(The corresponding decimal number is 38): Indicates the redo log type when inserting a record using the compact row format.小提示:
Redundant is a relatively primitive row format, which is not compact. The Compact, Dynamic, and Compressed row formats are newer row formats that are compact (take up less storage space) -
MLOG_COMP_PAGE_CREATE
(The decimal number corresponding to the type field is 58): Indicates the redo log type that creates a page that stores records in compact row format. -
MLOG_COMP_REC_DELETE
(The decimal number corresponding to the type field is 42): Indicates to delete a redo log type recorded in compact row format. -
MLOG_COMP_LIST_START_DELETE
(The decimal number corresponding to the type field is 44): Indicates to delete a series of redo log types recorded in the compact row format from a given record on the page. -
MLOG_COMP_LIST_END_DELETE
(The decimal number corresponding to the type field is 43):MLOG_COMP_LIST_START_DELETE
It corresponds to the redo log of the type, indicating to delete a series of records untilMLOG_COMP_LIST_END_DELETE
the record corresponding to the redo log of the type.小提示:
When we talked about the InnoDB data page format, we emphasized that the records in the data page form a one-way linked list in the order of the index column size. Sometimes we need to delete all records whose index column values are within a certain range. At this time, if we write a redo log every time we delete a record, the efficiency may be a bit low, so we propose MLOG_COMP_LIST_START_DELETE and MLOG_COMP_LIST_END_DELETE types of redo Logs can greatly reduce the number of redo logs. -
MLOG_ZIP_PAGE_COMPRESS
(The decimal number corresponding to the type field is 51): Indicates the redo log type for compressing a data page. -
······There are many, many types, so I won't list them here, I will talk about them when I use them~
These types of redo logs include both physical and logical meanings, specifically:
- At the physical level, these logs indicate which page in which tablespace has been modified.
- From a logical perspective, when the system crashes and restarts, it is not possible to directly restore a certain offset in the page to a certain data based on the records in these logs. Instead, it needs to call some pre-prepared functions. Only after these functions can the page be restored to the way it was before the system crashed.
You may see that this may be a bit confusing. Let’s take the MLOG_COMP_REC_INSERT
redo log when a record using the compact row format is inserted for this type as an example to understand what we mean by the physical level and logical level we mentioned above. Let's not talk nonsense, just look at MLOG_COMP_REC_INSERT
the structure of this type of redo log (because there are too many fields, it is better to see them vertically):
MLOG_COMP_REC_INSERT
There are several places in this type of redo log structure that need your attention:
-
We said earlier when we were learning about indexes that in a data page, whether it is a leaf node or a non-leaf node, the records are sorted in ascending order of the index columns. For secondary indexes, when the values of the index columns are the same, the records also need to be sorted according to the primary key value. The meaning of the value in the figure
n_uniques
is that in a record, the values of several fields are required to ensure the uniqueness of the record, so that when a record is inserted, it can ben_uniques
sorted according to the previous field of the record. For clustered indexes,n_uniques
the value is the number of columns in the primary key, and for other secondary indexes, the value is the索引列数+主键列
number of columns. It should be noted here that the value of the unique secondary index may beNULL
, so the value is still索引列数+主键列数
. -
field1_len ~ fieldn_len
Represents the size of the storage space occupied by several fields of the record. It should be noted that, regardless of whether the type of the field is fixed-length (for example)INT
or variable-length (for exampleVARCHAR(M)
), the size occupied by the field Always write to the redo log. -
offset
It represents the address of the previous record of this record in the page. Why record the address of the previous record? This is because every time you insert a record into a data page, you need to modify the record list maintained in the page. The record header information of each record contains an attribute called , so when inserting a new record, you need to modify the previousnext_record
record propertiesnext_record
. -
We know that a record is actually composed of
额外信息
two真实数据
parts, and the total size of these two parts is the total size of the storage space occupied by a record. The passedend_seg_len
value can indirectly calculate the total size of the storage space occupied by a record. Why not directly store the total size of the storage space occupied by a record? This is because writingredo
a log is a very frequent operation.InnoDB
Trying to reduceredo
the storage space occupied by the log itself, so I thought of some convoluted algorithms to achieve this goal.end_seg_len
This field is proposed to save the storage space of the redo log of. -
mismatch_index
The value of is also set to save the size of the redo log, you can ignore it.
Obviously, this type of MLOG_COMP_REC_INSERT
log redo
does not record PAGE_N_DIR_SLOTS
what the value was modified for, PAGE_HEAP_TOP
what the value was modified for, PAGE_N_HEAP
what the value was modified for, etc., but just write down all the necessary elements for inserting a record in this page , when the system crashes and restarts later, the server will call the function related to inserting a record to a page, and redo
the data in the log can be regarded as the parameters required to call this function. After calling the function, the page The values of PAGE_N_DIR_SLOTS
, PAGE_HEAP_TOP
, PAGE_N_HEAP
and so on in will be restored to the state before the system crashed. This is what is meant by a so-called logical log.
2.3 Summary of redo log format
Although a lot of content about the log format has been mentioned above redo
, if you are not writing a tool for parsing redo logs or developing a redo log system yourself, then there is no need to convert various types of redo logs in InnoDB The format is thoroughly studied, there is no need for that. Above I just introduced several types of redo log formats symbolically, the purpose is to let everyone understand: redo日志会把事务在执行过程中对数据库所做的所有修改都记录下来,在之后系统奔溃重启后可以把事务所做的任何修改都恢复出来
.
小提示:
In order to save the storage space occupied by redo logs, the uncle who designed InnoDB may also compress some data in redo logs. For example, spacd ID and pagenumber generally occupy 4 bytes for storage, but after compression, they may be used Less space to store. I won't talk about the specific compression algorithm.
3. Mini-Transaction
3.1 Write redo logs in the form of groups
A statement may modify several pages during execution. For example, a statement we mentioned earlier INSERT
may modify the attributes of the page whose page number is 7 in the system tablespace Max Row ID
(of course, it may also update other system pages, but we have not listed them all), and will also 更新聚簇索引和二级索引对应B+树
update the pages. Since the changes to these pages all occur in Buffer Pool
, after modifying the pages, you need to record the corresponding redo
logs. The log generated during the execution of the statement redo
is artificially divided into several by the uncle who designed InnoDB 不可分割的组
, such as:
Max Row ID
The redo log generated when updating attributes is indivisible.聚簇索引对应B+树的页面中插入一条记录
The redo log generated in the forward time is indivisible .某个二级索引对应B+树的页面中插入一条记录
The redo log generated in the forward time is indivisible .- There are other redo logs generated during page access operations that are inseparable. . .
How to understand the meaning of indivisibility? Let's take inserting a record into the B+ tree corresponding to an index as an example. Before inserting this record into the B+ tree, we need to locate the data page represented by the leaf node where this record should be inserted, and locate the specific After the data page, there are two possible situations:
-
Situation 1: The remaining free space of the data page is sufficient to accommodate the record to be inserted, then the matter is very simple, just insert the record into the data page directly, and record a redo log of type , we put
MLOG_COMP_REC_INSERT
this This situation is called乐观插入
. If the B+ tree corresponding to an index looks like this:Now we want to insert a
10
record with a key value, which obviously needs to be inserted into页b
. Since the page now has enough space to accommodate a record, it is good tob
insert the record directly into the page , like this:b
-
Situation 2: The remaining free space of the data page is insufficient, then things will be tragic. As we said before, in this case, we need to perform the so-called, that is, create a new leaf node, and then
页分裂操作
copy some records in the original data page To this new data page, then insert the record, insert this leaf node into the leaf node linked list, and finally add a一条目录项记录
point to this newly created page in the inner node. Obviously, this process needs to modify multiple pages, which means that multipleredo
logs will be generated, which we call this situation悲观插入
. If the B+ tree corresponding to an index looks like this:Now we want to insert a
10
record with a key value, which obviously needs to be inserted into页b
, but it can also be seen from the figure that at this time页b已经塞满了记录
, there is no more free space to accommodate this new record, so we need to perform page The split operation, like this:
If页a
the remaining free space as an internal node is not enough to accommodate the increase加一条目录项记录
, then you need to continue to be an internal node页a的分裂操作
, which means that more pages will be modified, thereby generating moreredo
logs. In addition,悲观插入
because of the need to apply for new data pages, some system pages need to be modified, for example, to modify the statistical information of various segments and areas, and the statistical information of various linked lists (such as what are we talking about the table spaceFREE链表、FSP_FREE_FRAG链表
? All kinds of things introduced in that chapter) and so on, anyway, there are 20 or 30 redo logs that need to be recorded.小提示:
In fact, not only pessimistically inserting a record will generate many redo logs, but also for some other functions when designing InnoDB, multiple redo logs may also be generated when optimistically inserting (we won’t say more about the specific functions, otherwise the space I can't take it anymore~).
When designing InnoDB
, it is considered that the process of inserting into the B+ tree corresponding to a certain index 一条记录
must be atomic, and it cannot be said that it will stop after half of the insertion. For example, in the pessimistic insertion process, the new page has been allocated, the data has been copied, and the new record has been inserted into the page, but it has not been inserted into the inner node. This insertion process is incomplete, which 一条目录项记录
will Form an incorrect B+ tree. We know that redo
the log is to restore the state before the crash when the system crashes and restarts. If only a part of the redo log is recorded during the pessimistic insertion process, then the B+ tree corresponding to the index will be restored to an incorrect state when the system crashes and restarts. state, which is unbearable when InnoDB was designed. Therefore, they stipulate that when performing these operations that need to ensure atomicity, the redo logs must be recorded in the form of groups. When performing system crash restart recovery, 针对某个组中的redo日志,要么把全部的日志都恢复掉,要么一条也不恢复
. How did you do it? This score situation is discussed:
-
Some
原子性
operations that need to be guaranteed will generate multipleredo
logs. For example, a pessimistic insertion into the B+ tree corresponding to an index needs to generate many redo logs. How to divide these redo logs into a group? When designing InnoDB, I made a very simple trick, which is to add a special type ofredo
log after the last redo log in the group. The type name isMLOG_MULTI_REC_END
, , and the log structuretype字段对应的十进制数字为31
of this type is very simple, with only one field: so A series of redo logs generated by an operation that needs to be guaranteed to be atomic must end with a type, like this:redo
type
MLOG_MULTI_REC_END
-
In this way, when the system crashes and restarts to recover, only when the
MLOG_MULTI_REC_END
redo log of type is parsed, it is considered that a complete set of redo logs has been parsed, and the recovery will be performed. Otherwise, just give up the redo log parsed earlier. -
Some operations that need to ensure atomicity only generate one
redo
log. For example,Max Row ID
the operation of updating attributes only generates one redo log.In fact,
MLOG_MULTI_REC_END
it is also possible to follow a log with a type of redo log, butInnoDB
it is more thrifty, and they don't want to waste a bit. Don’t forget that althoughredo
there are many types of logs, there are dozens of them, which are smaller than127
this number. That is to say, we use 7 bits to cover allredo
log types, and the type field actually occupies1
1 word. In other words, we can save a bit to indicate that the operation that needs to ensure atomicity only generates a singleredo
log, as shown in the schematic diagram: -
If
type
the first bit of the field is1
, it means that the operation that needs to ensure atomicity only generates a single redo log, otherwise it means that the operation that needs to ensure atomicity generates a series of redo logs.
3.2 The concept of Mini-Transaction
The process of MySQL’s atomic access to the underlying page is called one Mini-Transaction
, for short mtr
. For example, the value modified once mentioned above Max Row ID
is one Mini-Transaction
, and the process of inserting a record into the B+ tree corresponding to an index is also one Mini-Transaction
. Through the above description, we also know that a so-called mtr可以包含一组redo日志,在进行奔溃恢复时这一组redo日志作为一个不可分割的整体
.
A transaction can contain several statements, and each statement is actually composed of several statements mtr
, each of which mtr
can contain them 若干条redo日志
. Draw a picture to show their relationship like this:
4. The writing process of redo log
4.1 redo log block
InnoDB
In order to better recover from system crashes during design , they put the logs mtr
generated by the pass in pages with a size of one byte. In order to distinguish it from the pages in the table space we mentioned earlier, we call the pages used to store logs here (you know the meaning of pages and blocks is almost the same). A schematic diagram is as follows:redo
512
redo
block
redo log block
The real redo
logs are all stored in 496
the size of the bytes , and the and stored log block body
in the figure are some management information. Let's take a look at what these so-called management information are:log block header
log block trailer
The meanings of several of these log block header
properties are as follows:
-
LOG_BLOCK_HDR_NO
: Each block has a unique label greater than 0, and this attribute indicates the value of the label.This attribute is
block
assigned when it is first used, andlsn
is related to the system value at that time. Use the following formula to calculate theblock
valueLOG_BLOCK_HDR_NO
:((lsn / 512) & 0x3FFFFFFFUL) + 1
this formula0x3FFFFFFFUL
may confuse everyone, but its binary representation may be more friendly:It can be seen from the figure that
0x3FFFFFFFUL
the first 2 bits of the corresponding binary number are 0, and the values of the last 30 bits are all 1. When we first started learning computers, we learned that(&)
the result of an AND operation between a binary bit and 0 is definitely 0, and(&)
the result of an AND operation between a binary bit and 1 is the original value. To make an AND0x3FFFFFFFUL
operation means to set the value of the first 2 bits of the value to 0, so that the value must be less than or equal to it0x3FFFFFFFUL
. This also shows that no matter how large lsn is,((lsn / 512) & 0x3FFFFFFFUL)
the value of lsn must be0~0x3FFFFFFFUL
between, and if you add 1, it must be1~0x40000000UL
between. And0x40000000UL
this value should be familiar to everyone, this value represents 1GB. That is to say, the system can generate at most only one uniqueLOG_BLOCK_HDR_NO
value1GB
. The designInnoDB
stipulatesredo
that the total size of all files contained in the log file group shall not exceed512GB
, and oneblock
size is512
bytes, that is to say, the maximum number of blocksredo
contained in the log file group is , so a non-repeating number value is enough.block
1GB
1GB
In addition,
LOG_BLOCK_HDR_NO
the first bit of the value is special, soflush bit
if the value is 1, it means that this block is the first to be flushed in an operation that flushes the block in the log buffer to the disk block. -
LOG_BLOCK_HDR_DATA_LEN
: Indicates how many bytes have been used in the block初始值为12
(because the log block body starts from the 12th byte). As more and more redo logs are written to the block, the value of this attribute also increases. iflog block body已经被全部写满,那么本属性的值被设置为512
-
LOG_BLOCK_FIRST_REC_GROUP
: A redo log can also be called a redo log record (redo logrecord
), and one mtr will produce multiple redo log records, and these redo log records are called一个redo日志记录组
(redo log record group
)。LOG_BLOCK_FIRST_REC_GROUP
to represent the firstmtr
generatedredo
log record group in the block Offset (in fact, it is the offset ofblock
the first log generated by the first mtr here ).redo
-
LOG_BLOCK_CHECKPOINT_NO
: Indicates the so-calledcheckpoint
serial number,checkpoint
which is the focus of our follow-up content. Don’t need to clarify its meaning now, so don’t be impatient.
log block trailer
The meanings of the attributes are as follows:
LOG_BLOCK_CHECKSUM
: Indicates the verification value of the block, which is used for correctness verification, and we don't care about it for the time being
4.2 redo log buffer
As we said before, InnoDB was designed to solve the problem 磁盘速度过慢的问题而引入了Buffer Pool
. In the same way, 写入redo日志时也不能直接直接写到磁盘上,实际上在服务器启动时就向操作系统申请了一大⽚称之为redo log buffer的连续内存空间
, translated into Chinese is redo日志缓冲区
, we can also call it for short log buffer
. This piece of memory space is divided into several contiguous ones redo log block
, like this:
We can innodb_log_buffer_size
specify log buffer
the size through the startup parameter, the default value of the startup parameter is 16MB
.
mysql> show variables like 'innodb_log_buffer_size';
+------------------------+----------+
| Variable_name | Value |
+------------------------+----------+
| innodb_log_buffer_size | 16777216 |
+------------------------+----------+
1 row in set (0.01 sec)
Large log buffers enable large transactions to run without writing the log to disk before the transaction commits. Therefore, if you have transactions that update, insert, or delete many rows, increasing the log buffer can save disk I/O.
mysql> set persist innodb_log_buffer_size =33554432;
Query OK, 0 rows affected (0.04 sec)
4.3 Redo log is written to log buffer
log buffer
The process of writing logs to redo
the medium is sequential, that is, block
write to the previous medium first, and then write block
to the next medium when the free space of the corresponding medium is used up . block
When we want to write logs log buffer
to , so we specially provide a global variable called , which indicates where the subsequent redo logs should be written to, as shown in the figure:redo
第一个遇到的问题就是应该写在哪个block的哪个偏移量处
InnoDB
buf_free
log buffer
We said earlier that mtr
several redo
logs may be generated during an execution process, and these redo
logs are an inseparable group, so in fact, redo
it is not inserted into the log every time a log is generated logbuffer
, but mtr
the logs generated during each running process are first Temporarily store it in one place, and when it is mtr
time to end, redo
copy all the logs generated during the process to log buffer
. Let's now assume that there are two transactions named T1
, T2
each of which includes 2个mtr
, let's name these mtrs:
事务T1
The twomtr
are calledmtr_T1_1
andmtr_T1_2
事务T2
The twomtr
are calledmtr_T2_1
andmtr_T2_2
Each mtr
will generate a set of redo
logs, and use a schematic diagram to describe the mtr
generated logs:
Different transactions may be 并发执行
yes, so the mtr between T1 and T2 may be 交替执行
yes. 每当一个mtr执行完成时,伴随该mtr生成的一组redo日志就需要被复到log buffer 中
, that is to say, the mtr of different transactions may be written alternately log buffer
. Let's draw a schematic diagram (for the sake of beauty, we draw all the redo logs generated in one mtr as a whole):
From the schematic diagram, we can see that mtr
the storage space occupied by different sets of redo logs may be different. Some mtrs generate a small amount of redo logs, while some mtrs generate a very large amount of redo logs.
Five, redo log
5.1 Timing of flushing redo logs
We said earlier that a set of logs mtr
generated during the running process will be copied to the server at the end , but it is not a good idea to keep these logs in memory. In some cases, they will be flushed to disk, for example:redo
mtr
log buffer
log buffer空间不足时
: The size of the log buffer is limited (innodb_log_buffer_size
specified by system variables), if you keep adding logs to this limited size log buffer, it will be filled soon.InnoDB
It is considered that if it is current写入log buffer的redo日志量已经占满了log buffer总容量的大约一半左右
, these logs need to be flushed to disk.事务提交时
: We said earlier that the main reason for usingredo
the log is because it takes up less space, and it is still written sequentially. When the transaction is committed, the可以不把
modifiedBuffer Pool
pages are flushed to the disk. However, in order to ensure persistence, the redo logs corresponding to the modified pages must be Flush to disk.后台线程不停的刷
:
There is aMaster Thread
thread in the background, which flushes the logs to disk about once everylog buffer
secondredo
.- When shutting down the server gracefully
- When doing the so-called
checkpoint
(we haven't introduced the concept of checkpoint now, we will talk about it carefully later, don't be impatient) - Some other situations...
5.2 redo log file group
MySQL
SHOW VARIABLES LIKE 'datadir'
By default, there are two files named ib_logfile0
and in the data directory (use view), and the logs in are refreshed to these two disk files by default. If we are not satisfied with the default log file, we can adjust it through the following startup parameters:ib_logfile1
log buffer
redo
innodb_log_group_home_dir
: This parameter specifies the directory where the redo log file is located, and the default value is the current data directory.innodb_log_file_size
: This parameter specifies the size of each redo log file, the default value is48MB
innodb_log_files_in_group
: This parameter specifiesredo
the number of log files,默认值为2
,最大值为100
.
redo
As can be seen from the above description, there is not only one log file on the disk , but 一个日志文件组
in the form of . these files 以ib_logfile[数字](数字可以是0、1、2...)的形式进行命名
. When writing the redo log to the log file group, it is ib_logfile0
written from the beginning. If ib_logfile0
it is full, it will continue to ib_logfile1
write. Similarly, if ib_logfile1
it is full, it will be written ib_logfile2
, and so on. What if the last file is written? Then go back to ib_logfile0
continue writing, so the whole process is shown in the figure below:
The total redo
log file size is actually:innodb_log_file_size × innodb_log_files_in_group
小提示:
If the data is written to the redo log file group in a circular way, wouldn't it be tail-end, that is, the redo log written later will overwrite the redo log written earlier? Of course it is possible! Socheckpoint
the concept proposed by InnoDB, we will focus on explaining it later~
5.3 redo log file format
We said earlier that log buffer
it is essentially a continuous memory space, which is divided into several 512
byte sizes block
. The essence of refreshing the log in log buffer
the disk redo
to the disk is to block
write the image of the log file into the log file, so redo
the log file is actually composed of several 512
bytes in size block
. redo
Each file in the log file group is the same size and format, and consists of two parts:
- The first
2048
byte, that is, the former4个block
is used to store some management information 从第2048字节往后
is used forlog buffer中的block镜像
storage
So the circular use of redo log files we mentioned earlier is actually calculated from the 2048th byte of each log file. Draw a schematic diagram like this:
We have already mentioned the common block
format when we were nagging , that is , the three parts of , , and , so we won’t repeat the introduction. Here we need to introduce each one , that is, what are the formats of the first 4 special blocks, let’s not talk nonsense, let’s look at the picture first:log buffer
log block header
log block body
log blocktrialer
redo日志文件前2048个字节
As can be seen from the figure, the four blocks are:
log file header : Describe some overall properties of the redo log file, let's take a look at its structure:
The specific interpretation of each attribute is as follows:
attribute name | Length (unit: byte) | describe |
---|---|---|
LOG_HEADER_FORMAT | 4 | The version of the redo log, the value is always 1 |
LOG_HEADER_PAD1 | 4 | It is used for byte filling, it has no practical significance, ignore~ |
LOG_HEADER_START_LSN | 8 | Mark the LSN value at the beginning of this redo log file, that is, the LSN value corresponding to the beginning of the file offset of 2048 bytes (we will look at what LSN is later, ignore it if you don’t understand it) |
LOG_HEADER_CREATOR | 32 | A string identifying who the creator of this redo log file is. This value is the version number of MySQL during normal operation, for example: "The value of the redo log file created by MySQL using the mysqlbackup command is "ibbackup" and the creation time. |
LOG_BLOCK_CHECKSUM | 4 | The check value of this block, all blocks have it, we don't care |
小提示:
InnoDB has modified the block format of the redo log many times. If you find that the above attributes are different from the attributes in the books you read in other books, don’t panic. This is normal. In addition, we will introduce the LSN value later. , Now don't worry about what LSN is.
checkpoint1: record some attributes about checkpoint, look at its structure:
The specific interpretation of each attribute is as follows:
attribute name | Length (unit: byte) | describe |
---|---|---|
LOG_CHECKPOINT_NO | 8 | The server checkpoint number, every time a checkpoint is done, the value is increased by 1. |
LOG_CHECKPOINT_LSN | 8 | The corresponding LSN value at the end of the server checkpoint. When the system crashes and recovers, it will start from this value. |
LOG_CHECKPOINT_OFFSET | 8 | The offset of the LSN value in the previous attribute in the redo log file group |
LOG_CHECKPOINT_LOG_BUF_SIZE | 8 | The size of the corresponding logbuffer when the server performs checkpoint operations |
LOG_BLOCK_CHECKSUM | 4 | The check value of this block, all blocks have it, we don't care |
小提示:
It is normal to not understand the above explanations about the attributes of checkpoint and LSN. I just want everyone to be familiar with the above attributes, and we will talk about them in detail later.
The third block : unused, ignore~
checkpoint2 : The structure is the same as checkpoint1
六、Log Sequeue Number
Since the system starts running, the page is constantly being modified, which means that redo
logs are constantly being generated. redo
The amount of logs is constantly increasing, just like the age of a person, it has been increasing since birth, and it can never be reduced. InnoDB
In order to record the amount of logs that have been written redo
, a global variable is designed Log Sequeue Number
, which translates to: 日志序列号,简称lsn
. However, unlike the birth age of a person who is 0 years old, the uncle who designed InnoDB stipulated the initial lsn value 8704
(that is, when a redo log has not been written, the lsn value is 8704).
We know that when log buffer
writing logs to redo
the log, it is not written one by one, but written in units of a mtr
generated set of logs. redo
And in fact, the log content is written at logblock body
. However, when counting the growth amount, it is calculated lsn
based on the actual written log amount plus the occupied sum log block header
. log block trailer
Let's look at an example:
-
When the system is initialized after the first startup
log buffer
, ( the variable that marks the location wherebuf_free
the nextredo
log should be written ) will point to the first place where the offset is bytes ( size), and the lsn value will follow increase by 12:log buffer
block
12
log block header
If the storage space occupied by a mtr
generated set redo
of logs is relatively small, that is, when the remaining free space of the block to be inserted can accommodate the mtr
submitted log, lsn
the increase amount is the number of bytes occupied by the mtr
generated log, like this:redo
-
We assume that the amount of logs
mtr_1
generated in the above figureredo
is200
bytes, thenlsn
it will be8716
increased on the basis of200
and becomes8916
. -
If the storage space occupied by
mtr
a set of generated logs is relatively large, that is, when the remaining free spaceredo
to be inserted is not enough to accommodate the submitted logs, the increase will be the number of bytes occupied by the generated logs plus the additional occupied and bytes, like this:block
mtr
lsn
mtr
redo
log block header
log block trailer
-
We assume that the amount of logs
mtr_2
generated in the above figure is bytes. In order to write the generated logs , we have to allocate two more , so the value of needs to be increased on the basis ofredo
1000
mtr_2
redo
log buffer
block
lsn
8916
1000 + 12×2 + 4 × 2 = 1032
小提示:
Why is the initial lsn value 8704? I don't know too well, that's how people stipulate. In fact, you can also stipulate that you are counted as one year old when you are born, as long as you ensure that your age continues to grow as time goes by.
As can be seen from the description above, 每一组由mtr生成的redo日志都有一个唯一的LSN值与其对应,LSN值越小,说明redo日志产生的越早
.
6.1 flushed_to_disk_lsn
redo
The log is first written log buffer
to and then flushed to redo
the log file on disk. So InnoDB
came up with a buf_next_to_write
global variable called, tag 当前log buffer中已经有哪些日志被刷新到磁盘中了
. Draw a picture to show that it is like this:
We said earlier lsn
that it indicates the amount of logs written in the current system redo
, which includes log buffer
logs that are written but not flushed to disk. Correspondingly, InnoDB proposes a redo
global variable that represents the amount of logs flushed to disk, called it flushed_to_disk_lsn
. When the system starts for the first time, the value of this variable is the same as the initial lsn value, which is 8704. As the system runs, redo
the log is continuously written log buffer
, but it is not immediately flushed to the disk, and the value of lsn and flushed_to_disk_lsn
the value of lsn widen the gap. Let's demonstrate:
-
After the system starts for the first time, the three logs generated by , , and
log buffer
are written to it . Assume that the corresponding values at the beginning and end of these three mtrs are:mtr_1
mtr_2
mtr_3
mtr
redo
lsn
mtr_1
:8716 ~ 8916mtr_2
:8916 ~ 9948mtr_3
:9948 ~ 10000
At this time, the lsn has grown to
10000
, but because there is no refresh operation,flushed_to_disk_lsn
the value at this time is still8704
as shown in the figure:log buffer
Then perform the operation ofblock
flushing the log toredo
the log file. Assuming that the log ofmtr_1
and is flushed to the disk, then the amount of logs written in and should be increased , so the value of is increased to , as shown in the figure:mtr_2
flushed_to_disk_lsn
mtr_1
mtr_2
flushed_to_disk_lsn
9948
To sum up, when a new redo
log is written log buffer
, the first lsn
value will increase, but flushed_to_disk_lsn
remain unchanged, and then as the ongoing log buffer
logs are flushed to disk, flushed_to_disk_lsn
the value will also increase. 如果两者的值相同时,说明log buffer中的所有redo日志都已经刷新到磁盘中了
.
小提示:
When an application program writes a file to the disk, it actually writes it to the buffer of the operating system first. If a write operation does not return until the operating system confirms that it has been written to the disk, it needs to call the fsync function provided by the operating system. . In fact只有当系统执行了fsync函数后
,flushed_to_disk_lsn
the value of will increase accordingly, when仅仅把log buffer中的日志写入到操作系统缓冲区却没有显式的刷新到磁盘时,另外的一个称之为write_lsn的值跟着增长
. However, for the convenience of everyone's understanding, we confuse the concepts offlushed_to_disk_lsn
and when talking about it .write_lsn
6.2 Correspondence between lsn value and redo log file offset
Because the value of is a sum lsn
representing the amount of logs written by the system , as many logs are generated in one, the value of is increased (of course, sometimes the size of the sum is added ) , so when the generated logs are written to the disk, it is easy Calculate the offset of a certain value in the log file group, as shown in the figure:redo
mtr
lsn
log block header
log blocktrailer
mtr
lsn
redo
The initial LSN
value is 8704
corresponding to the file offset 2048
, and then the value will increase as mtr
many bytes of logs are written to the disk .lsn
6.3 LSN in the flush list
We know that an mtr
atomic access to the underlying page may generate a set of indivisible redo
logs during the access process, and at mtr
the end, this set redo
of logs will be written to log buffer
. In addition, mtr
there is another very important thing to do at the end, which is to mtr
add pages that may have been modified during execution to Buffer Pool
the flush
linked list. In order to prevent everyone from forgetting flush
what a linked list is, let's look at the picture again:
When modifying a cached page for the first time Buffer Pool
, the control block corresponding to this page will be inserted into it flush链表的头部
, and when the page is modified later, because it is already in flush
the linked list, it will not be inserted again. That is flush链表中的脏页是按照页面的第一次修改时间从大到小进行排序的
. During this process, two attributes about when the page is modified will be recorded in the control block corresponding to the cache page:
oldest_modification
: If a page is loaded andBuffer Pool
modified for the first time, then themtr
corresponding lsn value at the beginning of modifying the page will be written into this propertynewest_modification
: Every time a page is modified, themtr
correspondinglsn
value at the end of modifying the page will be written into this property. That is to say, this attribute indicates the corresponding system lsn value after the page was last modified
Let's take a look at the nagging example above flushed_to_disk_lsn
:
-
Assuming that
mtr_1
it is modified during the execution页a
, the corresponding control blockmtr_1
will be added to the head of the linked list at the end of the execution. And the corresponding at the beginning is written into the attribute of the corresponding control block , and the corresponding at the end is written into the attribute of the corresponding control block . Draw a picture to show it (in order to make the picture more beautiful, we put it ):页a
flush
mtr_1
lsn
8716
页a
oldest_modification
mtr_1
lsn
8916
页a
newest_modification
oldest_modification缩写成了o_m,把newest_modification缩写成了n_m
-
Then, assuming that two pages of and
mtr_2
are modified during the execution , then at the end of the execution, the corresponding control blocks of and will be added to the head of the page. And write what is corresponding at the beginning , that is, write it into the attribute of the corresponding control block , and write what is corresponding at the end , that is, write it into the attribute of the corresponding control block . Draw a picture to show:页b
页c
mtr_2
页b
页c
flush链表
mtr_2
lsn
8916
页b
页c
oldest_modification
mtr_2
lsn
9948
页b
页c
newest_modification
-
It can be seen from the figure that each new
flush
node inserted into the linked list is placed at the head, that is to say, theflush
dirty pages in the front of the linked list are modified later, and the dirty pages in the latter are modified earlier. -
Then assume that the and
mtr_3
are modified during the execution process , but they have been modified before, so its corresponding control block has been inserted , so at the end of the execution, you only need to add the corresponding control blocks to the header. Therefore, it is necessary to write the corresponding at the beginning , that is, write it into the attribute of the corresponding control block , and write the corresponding at the end , that is, write it into the attribute of the corresponding control block . In addition, . Draw a picture to show:页b
页d
页b
flush链表
mtr_3
页d
flush链表
mtr_3
lsn
9948
页d
oldest_modification
mtr_3
lsn
10000
页d
newest_modification
由于页b在mtr_3执行过程中又发生了一次修改,所以需要更新页b对应的控制块中newest_modification的值为10000
To sum up what I said above, it is: flush链表中的脏页按照修改发生的时间顺序进行排序,也就是按照oldest_modification代表的LSN值进行排序,被多次更新的页面不会重复插入到flush链表中,但是会更新newest_modification属性的值
.
6.4 checkpoint
It is an unfortunate fact that the capacity of our redo log file group is limited, we have to choose 循环使用redo日志文件组中的文件
, but this will cause the last redo log to be written and the first to be written redo日志追尾
, then we should think of: redo日志只是为了系统奔溃后恢复脏页用的,如果对应的脏页已经刷新到了磁盘,也就是说即使现在系统奔溃,那么在重启后也用不着使用redo日志恢复该页面了,所以该redo日志也就没有存在的必要了,那么它占用的磁盘空间就可以被后续的redo日志所重用
. That is to say: 判断某些redo 日志占用的磁盘空间是否可以覆盖的依据就是它对应的脏页是否已经刷新到磁盘里
. Let's take a look at the example that has been nagging before:
As shown in the figure, although the generated mtr_1
logs have been written to the disk, the dirty pages modified by them are still left in the disk , so the space of the logs generated by them on the disk cannot be overwritten. Then as the system runs, if it is flushed to disk, its corresponding control block will be removed from it , like this:mtr_2
redo
Buffer Pool
redo
页a
flush链表
mtr_1
The logs generated in this way redo
are useless, and the disk space they occupy can be overwritten. The design InnoDB
is to propose a global variable to represent the total amount of logs checkpoint_lsn
that can be overwritten in the current system , and the initial value of this variable is also the same .redo
8704
For example, 页a
if it is flushed to disk now, redo
the log generated by mtr_1 can be overwritten, so we can perform an additional checkpoint_lsn
operation, and we call this process one time checkpoint
. Doing it once checkpoint
can actually be divided into two steps:
-
Step 1: Calculate the maximum value
redo
corresponding to the log that can be overwritten in the current systemlsn
redo
The log can be overwritten, which means that its corresponding dirty page has been flushed to the disk. As long as we calculate the value corresponding to the earliest modified dirty page in the current system, all logs generated when the system lsn value is less than the value of the node willoldest_modification
be It can be overwritten, we assign the dirty page to .oldest_modification
redo
oldest_modification
checkpoint_lsn
For example, if the current system
页a
has been flushed to the disk, then theflush链表
tail node页c
is the first dirty page modified in the current system. Itsoldest_modification
value is8916
8916, so we assign it tocheckpoint_lsn
(that is to say, in the redo log corresponding to When the lsn value is less than 8916, it can be overwritten). -
Step 2: Write the
checkpoint_lsn
correspondingredo
log file group offset and this number into the management information (that is , or )checkpint
of the log file .checkpoint1
checkpoint2
InnoDB
checkpoint
It maintains a variable of how many times the system has done so far , and the value of the variable is incrementedcheckpoint_no
every time it is done . We said earlier that it is easy to calculate the log file group offset corresponding to a value , so we can calculate the corresponding offset in the log file group , and then write these three values to the management of the log file group information.checkpoint
1
lsn
redo
checkpoint_lsn
redo
checkpoint_offset
redo
We said that each
redo
log file has2048
a byte of management information, but the abovecheckpoint
information will only be written to the management information of the first log file in the log file group. But do we store incheckpoint1
orcheckpoint2
in? InnoDB specifies,当checkpoint_no的值是偶数时,就写到checkpoint1中,是奇数时,就写到checkpoint2中
After recording checkpoint
the information, the relationship of redo
each value in the log file group lsn
is like this:
6.5 Batch flush dirty pages from the flush list
As we Buffer Pool
said in the introduction, under normal circumstances, the background thread is cleaning the LRU
linked list and the linked list. This is mainly because the cleaning operation is relatively slow and does not want to affect the user thread to process the request. flush
However, if the current system modifies pages very frequently, this will lead to frequent log writing operations, and the system lsn value will increase too fast. If the background dirty page cannot be flushed out by the dirty page, the system cannot do it in time checkpoint
, and it may be necessary for the user thread to flush the earliest modified dirty page ( oldest_modification
the smallest dirty page) to the disk from the flush list synchronously, so that these dirty pages The redo log corresponding to the page is useless, and then you can do checkpoint
it.
6.6 View various LSN values in the system
We can use SHOW ENGINE INNODB STATUS
commands to view the various values InnoDB
in the current storage engine , such as:LSN
LOG
---
mysql> SHOW ENGINE INNODB STATUS\G;
(...省略前边的许多状态)
Log sequence number 619362521
Log buffer assigned up to 619362521
Log buffer completed up to 619362521
Log written up to 619362521
Log flushed up to 619362521
Added dirty pages up to 619362521
Pages flushed up to 619362521
Last checkpoint at 619362521
Log minimum file id is 176
Log maximum file id is 189
80457 log i/o's done, 0.00 log i/o's/second
(...省略后边的许多状态)
in:
Log sequence number
: Represents the lsn value in the system, that is, the amount of redo logs written by the current system, including the logs written in the log buffer.Log flushed up to
:flushed_to_disk_lsn
The value represented, that is, the amount of redo logs that the current system has written to disk.Pages flushed up to
: Represents the attribute value corresponding to the page that was first modified in the flush listoldest_modification
.Last checkpoint at
: The current systemcheckpoint_lsn
value.
Seven, the usage of innodb_flush_log_at_trx_commit
We said earlier that in order to ensure the transaction 持久性
, the user thread needs to flush all the logs generated during the execution of the transaction redo
to the disk when the transaction is committed. This requirement is too strict and will obviously reduce database performance. If some students do not have such strong requirements for transaction persistence, they can choose to modify innodb_flush_log_at_trx_commit
the value of a system variable called , which has 3 optional values:
- 0: When the value of this system variable is 0, it means that the redo log is not immediately synchronized to the disk when the transaction is committed, and this task is handed over to the background thread. This will obviously speed up the request processing, but if the server hangs up after the transaction is committed, and the background thread does not flush the redo log to the disk in time, then the modification of the page by the transaction will be lost.
- 1: When the value of this system variable is 1, it means that the redo log needs to be synchronized to the disk when the transaction is committed, which can ensure the durability of the transaction.
1也是innodb_flush_log_at_trx_commit的认值
. - 2: When the value of this system variable is 2, it means that the redo log needs to be written to the buffer of the operating system when the transaction is committed, but it does not need to ensure that the log is actually flushed to the disk. In this case, if the database is down and the operating system is not down, the persistence of the transaction can still be guaranteed, but if the operating system is also down, then the persistence cannot be guaranteed.
Eight, crash recovery
When the server is not hung up, redo
the log is simply a big burden. Not only is it useless, but it makes the performance worse. But in case, I said in case, in case the database hangs up, the redo log is a treasure. We can restore the page to the state before the system crashed according to the records in the redo log when restarting. Let's take a closer look at what the recovery process looks like.
8.1 Determining the starting point for recovery在这里插入代码片
As we said before, the checkpoint_lsn
previous redo
logs can be overwritten, that is to say, the dirty pages corresponding to these redo logs have been flushed to the disk. Since they have been flushed, there is no need to restore them. For the checkpoint_lsn
subsequent redo
logs, their corresponding dirty pages may not have been flushed, or they may have been flushed. We cannot be sure, so we need to checkpoint_lsn
read the log from the beginning redo
to restore the page. Of course, there are two stored information redo
in the management information of the first file in the log file group , and we certainly want to select the information that happened most recently . The information that measures the time of occurrence is the so-called . We only need to read out the value of these two and compare the size. Whichever value is larger indicates which block stores the most recent information. This way we can get the most recent corresponding value and its offset in the redo log file group .block
checkpoint_lsn
checkpoint
checkpoint
checkpoint_no
checkpoint1
checkpoint2
block
checkpoint_no
checkpoint_no
checkpoint
checkpoint
checkpoint_lsn
checkpoint_offset
8.2 Determining the endpoint of recovery
The starting point of redo log recovery is determined, so which is the end point? This has to start with the structure of the block. We say that when writing redo logs, they are written sequentially. After a block is filled, it will be written in the next block:
block
The common log block header
part has an LOG_BLOCK_HDR_DATA_LEN
attribute called , which records how many bytes of space are used in the current block. For a filled block, this value is always 512. If the value of this attribute is not 512, then it is, and it is the last block that needs to be scanned in this crash recovery.
8.3 How to recover
After determining which redo
logs need to be scanned for crash recovery, the next step is how to recover. Suppose there are 5 redo logs in the current redo log file, as shown in the figure:
Since it redo0
is at checkpoint_lsn
the back, it can be ignored when recovering. We can now redo
scan checkpoint_lsn
the subsequent redo
logs in sequence according to the order of the logs, and restore the corresponding pages according to the content recorded in the logs. There is no problem with this, but InnoDB
I still think of some ways to speed up the recovery process:
-
Use the hash table to calculate the hash value
according to r , and if there are multiple redo logs with the same space ID and page number, then use a linked list to connect them and link them in the order of generation, as shown in the figure Show:edo日志的space ID和page number属性
space ID和page number相同的redo日志放到哈希表的同一个槽里
-
After that, the hash table can be traversed, because the redo logs that modify the same page are placed in one slot, so one page can be repaired at one time (avoiding a lot of random IO for reading pages), which can speed up recovery speed. Another thing to note is that the redo logs of the same page are sorted in the order of generation time, so they are also restored in this order during recovery. If they are not sorted in the order of generation time, errors may occur. For example, the original modification operation is to insert a record first, and then delete the record. If the restoration is not performed in this order, it may change to delete a record first, and then insert a record, which is obviously wrong.
-
Skip the pages that have been flushed to the disk
. As we said before, the dirty pages corresponding tocheckpoint_lsn
the previousredo
logs must have been flushed to the disk, but we cannot be sure whethercheckpoint_lsn
the subsequentredo
logs have been flushed to the disk, mainly because after the latestcheckpoint
log , the background thread may continue to flush some dirty pages out of the Buffer Pool from the LRU linked list and the flush linked list. For thesecheckpoint_lsn
subsequent redo logs, if their corresponding dirty pages have been flushed to disk when the crash occurs, then there is no need to modify the page according to the content of the redo log during recovery.
Then how do you know redo
whether the dirty pages corresponding to a certain log have been flushed to disk when the crash occurs during recovery? This has to start with the structure of the page. As we said earlier, each page has a File Header
part called , and there Header
is an FIL_PAGE_LSN
attribute called , which records the corresponding lsn value when the page was last modified. (In fact, it is the value in the page control block newest_modification
). If checkpoint
a dirty page is flushed to the disk after a certain time, then the FIL_PAGE_LSN
lsn value corresponding to the page must be greater than the checkpoint_lsn
value of , any page that meets this situation does not need to repeatedly execute the log lsn
with a value FIL_PAGE_LSN
less redo
than Further improved the speed of crash recovery.