Three major MySQL logs: redo log, undo log, binlog

The mysql log is divided into several categories: error log, query log, slow query log, transaction log (redo log and undo log), binary log (binlog).

binlog

Regarding the database log, for a simple example, after the hard disk is loaded into the memory, we perform a series of operations on the data. Before refreshing to the hard disk, we have to record it at the XXX position, and then perform normal additions, deletions and modifications. Check the operation, and finally flash into the hard disk. If the hard disk is not flashed, after restarting, load the previous record first, then the data will come back.

It is used to record the write operation (not including query) performed by the database , and it is stored in the disk in binary form. Binlog is the logical log of mysql (it can be understood as recording the sql statement ), and it is recorded by the Server layer. The mysql database using any storage engine will record the binlog log.

use:

  • Master-slave replication: MySQL Replication opens binlog on the master side, and the master passes its binary log to slaves and replays it to achieve the purpose of master-slave data consistency
  • Data recovery: recover data through mysqlbinlog tool
  • Incremental backup

View:

    1.  mysqlbinlog  mysql-bin.000007

    2. 命令行解析 SHOW BINLOG EVENTS  [IN 'log_name']  [FROM pos]    [LIMIT [offset,] row_count]

mysql> show binlog events in 'mysql-bin.000007' from 1190 limit 2\G

Format: STATMENT, ROW and MIXED

  • SQL statement-based replication (statement-based replication, SBR), each SQL statement that will modify data will be recorded in the binlog.
  • Row-based replication (RBR) does not record the context information of each SQL statement, but records which data has been modified.
  • Based on the mixed-based replication (MBR) of the above two modes, general replication uses the previous mode to save binlog, and operations that cannot be replicated use ROW mode to save binlog. Selection rule: If INSERT, UPDATE, DELETE are used to directly manipulate the table, the log format will be recorded according to the setting of binlog_format; if it is done with management statements such as GRANT, REVOKE, SET PASSWORD, etc., then statement is used anyway Pattern record

binlog_format=statment format log content: show binlog events in'master.000001';

Log content in binlog_format=row format:

This picture is still viewed through the show command, but you can’t really see the details of the log yet, you need to use the command:

mysqlbinlog  -vv data/master.000001 --start-position=8900

Why is there a scene where mixed binlog format exists?

  • Some binlogs in statement format may cause inconsistencies between master and slave, so use row format.
  • The disadvantage of row format is that it takes up space. For example, if you use a delete statement to delete 100,000 rows of data, if you use statement, a SQL statement is recorded in the binlog, occupying dozens of bytes of space. But if you use row format binlog, you must write all these 100,000 records into binlog. Doing so will not only take up more space, but also consume IO resources to write binlog and affect the execution speed.
  • MySQL took a compromise solution, which is to have binlog in mixed format. The mixed format means that MySQL itself will judge whether this SQL statement may cause inconsistency between the master and the slave. If possible, use the row format, otherwise use the statement format.

Why do more and more scenarios now require the MySQL binlog format to be set to row?

  • delete statement, the binlog in row format will save the entire row of the deleted row. If you find that you have deleted the wrong data, you can directly convert the delete statement recorded in the binlog to insert, and insert the deleted data back to restore it.
  • If you execute the wrong insert statement, all field information will be recorded in the binlog of the insert statement, which can be used to pinpoint the row that was just inserted. At this time, you can directly convert the insert statement into a delete statement and delete the row of data that was inserted by mistake.
  • If the update statement is executed, the entire row of data before modification and the entire row of data after modification will be recorded in binlog. Therefore, if you execute the update statement by mistake, you only need to swap the two lines of information before and after the event, and then execute it in the database.

The standard way of using binlog to restore data is to parse it with the mysqlbinlog tool, and then send the entire analysis result to MySQL for execution. Similar to the following command:

将 master.000001 文件里面从第 2738 字节到第 2973 字节中间这段内容解析出来,放到 MySQL 去执行。
mysqlbinlog master.000001  --start-position=2738 --stop-position=2973 | mysql -h127.0.0.1 -P13000 -u$user -p$pwd;

Binlog writing mechanism:

  • The writing logic of binlog is relatively simple: during the execution of the transaction, the log is first written to the binlog cache (write) , and when the transaction is submitted, the binlog cache is written to the binlog file (fsync) .
  • The binlog of a transaction cannot be disassembled, so no matter how large the transaction is, it must be written once.
  • The system allocates a piece of memory to the binlog cache, one for each thread. The parameter binlog_cache_size is used to control the size of the memory occupied by the binlog cache in a single thread. If it exceeds the size specified by this parameter, it will be temporarily saved to disk.

Note: The log is written to the page cache of the file system, and the data is not persisted to disk, so the speed is faster. In general, we think that fsync only accounts for disk IOPS.

The timing of flashing is controlled by the parameter sync_binlog:

  • When sync_binlog=0, it means only write every time a transaction is submitted, not fsync;
  • When sync_binlog=1, it means that fsync will be executed every time a transaction is submitted;
  • When sync_binlog=N(N>1), it means that the transaction is written every time the transaction is submitted, but fsync is performed after N transactions are accumulated.

In scenarios where IO bottlenecks occur, setting sync_binlog to a larger value can improve performance. In actual business scenarios, considering the controllability of the amount of lost logs, it is generally not recommended to set this parameter to 0. It is more common to set it to a value from 100 to 1000. The default value for MySQL versions after 5.7.7 is 1.

redo log

produce:

One of the four major characteristics of transactions is persistence. Specifically, as long as the transaction is successfully submitted, the changes made to the database are permanently saved, and it is impossible to return to the original state for any reason. At this point, how does mysql ensure consistency?

The simplest way is to flush all data pages involved in the transaction to disk every time a transaction is committed. But doing so will have serious performance problems, which are mainly reflected in two aspects:

  • Innodb interacts with disks in units of pages , and a transaction may only modify a few bytes in a data page. At this time, flushing the complete data page to the disk is a waste of resources!
  • A transaction may involve modifying multiple data pages, and these data pages are not physically continuous, and the performance of using random IO write is too poor!

If MySQL is down and the data in the Buffer Pool is not completely flushed to disk, data will be lost and durability cannot be guaranteed. Therefore, mysql designed the redo log, specifically, it only records the changes made to the data page by the transaction . Relatively speaking, the file is smaller and is sequential IO .

basic concept:

The redo log consists of two parts: one is the log buffer in memory ( redo log buffer ), and the other is the log file on the disk ( redo log file ). Each time mysql executes a DML statement, it first writes the record to the redo log buffer, and then writes multiple operation records to the redo log file at a time. This technique of writing logs first and then writing to disk is the WAL (Write-Ahead Logging) technique often mentioned in MySQL .

Redo log writing process:  A. redo log buffer --> B. os buffer --> C. redo log file

Time to refresh:

There are three timings for writing redo log buffer to redo log file, which can be configured through the innodb_flush_log_at_trx_commit parameter.

0: Delayed writing, refreshing the data written to the disk approximately every second. If there is a system crash, 1 second of data may be lost, between AB in the process.

1: Real-time write, real-time flash, write to disk every time you submit, poor IO performance.

2: Real-time writing, delayed brushing, that is, brushing every second, between BC in the process.

The redo log of an uncommitted transaction will also be written to the disk. There are three timings:

  • InnoDB has a background thread, every 1 second, it will write the log in the redo log buffer, call write to the page cache of the file system, and then call fsync to persist to disk. Among them, the redo log in the middle of the transaction execution is also directly written in the redo log buffer.
  • When the space occupied by the redo log buffer is about to reach half of innodb_log_buffer_size, the background thread will actively write to the disk. Note that since this transaction has not been committed, the disk writing action is just write without calling fsync, that is, it is only left in the page cache of the file system.
  • When a parallel transaction is committed, by the way, the redo log buffer of this transaction is persisted to disk.

Usually we say that the "double 1" configuration of MySQL means that both sync_binlog and innodb_flush_log_at_trx_commit are set to 1. In other words, before a transaction is completely committed, you need to wait for two flushes, one for redo log (prepare phase), and one for binlog.

Record form:

The redo log actually records the changes of the data page, and this change record is not necessary to save all, so the redo log implementation adopts a fixed size and circular writing method. When the writing reaches the end, it will return to the beginning to write the log circularly .

The LSN (Logical Sequence Number) is monotonically increasing and is used to correspond to the write points of the redo log. Each time a redo log with a length of length is written, length is added to the value of LSN. The LSN will also be written to the InnoDB data page. At the head of each page, the value FIL_PAGE_LSNrecords the LSN of the page , indicating the size of the LSN when the page was last refreshed . To ensure that the data page will not be executed multiple redo log.

The write pos is the LSN position of the current record of the redo log, and checkpoint is the LSN position of the corresponding redo log after the data page change record is flushed. It also moves backward and loops. The record must be updated to the data file before erasing the record. The part between write pos and check point is the empty part of redo log, used to record new records; between check point and write pos is the data page change record of redo log to be placed on the disk, when write pos catches up with check point At the time , the check point will be pushed forward first, and the position will be vacated before recording a new log.

crash-safe:

When starting innodb, no matter whether it was shut down normally or abnormally last time, it will always perform a recovery operation. Because redo log records the physical changes of data pages, the recovery speed is much faster than logical logs (such as binlog). When restarting innodb, the LSN of the data page in the disk will be checked first. If the LSN of the data page is less than the LSN in the log, the recovery will start from the checkpoint.

There is also a situation where the checkpoint flushing process was in progress before the downtime, and the flushing progress of the data page exceeded the flushing progress of the log page. At this time, the LSN recorded in the data page is greater than the LSN in the log. The part that exceeds the log progress will not be redone, because it itself means that something has already been done and does not need to be redone.

Two-phase commit        

 Look at the update execution process in the figure below:

               

The writing of redo log is divided into two steps: prepare and commit . This is "two-phase commit", which is to make the logic between the two logs consistent.  

Group commit:

Suppose there is a scenario where multiple concurrent transactions are in the prepare phase. The transaction written first will be selected as the leader of this group. When writing disks, there are already three transactions in this group, and the LSN also becomes the group. The LSN of the last transaction, three transactions are written to disk at the same time.

In the concurrent update scenario, after the first transaction finishes writing the redo log buffer, the later this fsync is called, the more group members may be, and the better the effect of saving IOPS. Here, MySQL has an optimization mechanism:

As mentioned before, binlog is divided into two steps, write and fsync. In order to make the group submission effect better, MySQL delayed the redo log as fsync after the binlog write.

So the two-phase commit becomes:

                                                             

In this way, binlog can also be submitted as a group. When the binlog is fsynced to disk in step 4 of the execution figure, if the binlogs of multiple transactions have been written, they are also persisted together, which can also reduce the consumption of IOPS. However, step 3 is usually executed very quickly, so the interval between write and fsync of binlog is short, resulting in fewer binlogs that can be assembled together for persistence, so the effect of binlog group submission is usually not as good as the effect of redo log so nice.

If you want to improve the effect of binlog group submission, you can do so by setting binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count.

  • The binlog_group_commit_sync_delay parameter indicates how many microseconds to delay before calling fsync;
  • The binlog_group_commit_sync_no_delay_count parameter indicates how many times fsync is called after accumulating.

These two conditions are in an OR relationship, which means that as long as one of the conditions is met, fsync will be called. Therefore, when binlog_group_commit_sync_delay is set to 0, binlog_group_commit_sync_no_delay_count is also invalid.

At this point, some people will ask, WAL mechanism is to reduce disk writes, but redo log and binlog are written every time a transaction is submitted, and the number of disk reads and writes has not decreased? Now you can understand that the WAL mechanism mainly benefits from two aspects:

  • Both redo log and binlog are written sequentially, and the sequential write to disk is faster than random write;
  • The group submission mechanism can greatly reduce disk IOPS consumption.

So, what is the use of keeping the logic consistent between the two logs? Simply put, when you misuse the database or increase the read capacity for database expansion, this consistent performance ensures that the database data is restored to before the misoperation, or can achieve the goal of master-slave consistency.

binlog will record all logical operations, and is in the form of "additional write". If your DBA promises that the data within half a month can be restored, then all binlogs in the last half month will be saved in the backup system, and the system will regularly back up the entire database. The "regular" here depends on the importance of the system. It can be prepared once a day or once a week.

When you need to restore to a specified second, for example, at two o'clock in the afternoon one day, a table was mistakenly deleted at 12 noon and you need to retrieve the data, then you can do this:

First, find the most recent full backup. If you are lucky, it may be a backup from last night and restore from this backup to the temporary library;

Then, starting from the time of the backup, take out the backed up binlogs in turn, and replay them to the time before the accidental deletion of the table at noon.

You can refer to these two steps: https://zhuanlan.zhihu.com/p/33504555

How does the two-phase commit ensure consistency? Or if there is no two-stage submission, can the data be guaranteed to be consistent?

Using the example of the above flowchart again, assume that the current ID=2 row, the value of field c is 0, and then assume that after the first log is written during the execution of the update statement, the second log has not yet been written. Crash, what will happen?

1. Either write redo log first and then binlog. After the redo log is written, even if the system crashes, the data can still be recovered, so the value of c in this line after recovery is 1. But because the binlog crashed before finishing writing, this statement was not recorded in the binlog at this time. Therefore, when the log is backed up later, there will be no such statement in the saved binlog. Then you will find that if you need to use this binlog to restore the temporary library, because the binlog of this statement is lost, the temporary library will be missing this time. The value of the restored line c is 0, which is the same as the value of the original library. different.

2. Either write the binlog first and then the redo log. If it crashes after binlog is written, since the redo log has not been written yet, the transaction is invalid after the crash is restored, so the value of c in this line is 0. But the log "change c from 0 to 1" has been recorded in binlog. Therefore, when using binlog to restore later, there is an additional transaction. The value of c in the restored row is 1, which is different from the value of the original database.

Now, you can see what happens when MySQL restarts abnormally at different moments of the two-phase commit?

If a crash occurs after writing the redo log in the prepare phase and before writing the binlog, since the binlog has not yet been written and the redo log has not yet been committed, the transaction will be rolled back when the crash is restored. At this time, binlog has not been written yet, so it will not be transmitted to the standby database

If binlog is written and the redo log crashes before commit, what will MySQL do when the crash is restored? If the transaction in the redo log only has a complete prepare, then judge whether the corresponding transaction binlog exists and is complete, and if it is, the transaction is committed.

Ask a few questions:

1. Without introducing two logs, there is no need for two-phase commit. Only use binlog to support crash recovery, but also to support archiving, isn't that enough?

For historical reasons, InnoDB is not MySQL's native storage engine. MySQL's native engine is MyISAM, and it did not support crash recovery at the beginning of its design. Before InnoDB joined the MySQL engine family as a MySQL plug-in, it was already an engine that provided crash recovery and transaction support.

The reason for implementation is that binlog has no crash-safe capability.

2. On the other hand, can only redo log work?

One is that the redo log does not have the ability to archive, and it writes it cyclically. One is that the mysql system relies on binlog, and the basis for the high availability of the MySQL system is binlog replication.

3. In the normal running instance, after the data is written to the final disk, is it updated from the redo log or from the buffer pool?

In fact, the redo log does not record the complete data of the data page, so it does not have the ability to update the disk data page by itself, and there is no situation that "data is finally placed on the disk and updated by the redo log".

4. Why is the binlog cache maintained by each thread itself, while the redo log buffer is shared globally?

The main reason MySQL is designed this way is that binlog cannot be "interrupted". The binlog of a transaction must be written continuously, so after the entire transaction is completed, it is written to the file together.

The redo log does not have this requirement. The logs generated in the middle can be written to the redo log buffer. The content in the redo log buffer can also "free ride", and other transactions can be written to disk together when they are committed.

5. During the execution of the transaction, the commit phase has not yet been reached. If a crash occurs, the redo log must be lost. Will this cause inconsistency between the master and the slave?

will not. Because binlog is still in the binlog cache at this time, it is not sent to the standby database. After the crash, the redo log and binlog are gone. From a business perspective, the transaction is not committed, so the data is consistent.

6. If a crash occurs after the binlog is written, it will restart without giving the client a reply. Wait for the client to reconnect and find that the transaction has been submitted successfully. Is this a bug?

It's not. You can imagine a more extreme situation, the entire transaction is submitted successfully, redo log commit is completed, and the standby database also receives the binlog and executes it. However, the main library and the client's network are disconnected, and the packets that result in successful transactions cannot be returned. At this time, the client will also receive a "network disconnected" exception. This can only be regarded as a successful transaction, not a bug.

In fact, the crash-safe guarantee of the database is:

If the client receives a message that the transaction is successful, the transaction must be persistent;

If the client receives a transaction failure (such as primary key conflict, rollback, etc.), the transaction must have failed;

If the client receives an "execution exception" message, the application needs to reconnect and continue the subsequent logic by querying the current state. At this time, the database only needs to ensure that the internal (between data and log, between the main database and the standby database) is consistent.

undo log

One of the four characteristics of database transactions is atomicity. Specifically, atomicity refers to a series of operations on the database, either all succeed or all fail, and partial success is impossible.

In fact, the underlying atomicity is achieved through undo log. The undo log mainly records the logical changes of the data, such as an INSERT statement, corresponding to a DELETE undo log, for each UPDATE statement, corresponding to an opposite UPDATE undo log, so that when an error occurs, it can be rolled back to before the transaction Data status. At the same time, undo log is also the key to the realization of MVCC (multi-version concurrency control).

  • insert Undo Log is the undo log generated by the INSERT operation. Since it is the first record of the data, it is not visible to other transactions. The Undo Log can be deleted directly after the transaction is committed.
  • The update Undo Log records the Undo Log generated by the DELETE and UPDATE operations. Due to the MVCC mechanism, it cannot be deleted when the transaction is committed, but is placed in the undo log linked list, waiting for the purge thread to perform the final deletion.

 storage location:

The innodb storage engine uses a segment method for undo management. The rollback segment is called the rollback segment, and there are 1024 undo log segments in each rollback segment. In the previous version, only one rollback segment was supported, so that only 1024 undo log segments could be recorded. Later, MySQL5.5 can support 128 rollback segments, that is, 128*1024 undo operations. You can also customize the number of rollback segments through the variable innodb_undo_logs (this variable was innodb_rollback_segments before version 5.6). The default value is 128.

The Rollback Segment is stored in the shared table space by default, that is, in the ibdata file, or an independent UNDO table space can be set. When the DB write pressure is high, independent UNDO tablespaces can be set, and the number of independent tablespaces needs to be specified when the database instance is initialized. Then separate the UNDO LOG from the ibdata file and specify the innodb_undo_directory directory to store it, which can be set to a high-speed disk to speed up the read and write performance of the UNDO LOG.

How undo and redo record transactions

Assuming that there are two data A and B, the values ​​are 1, 2 respectively, and a transaction is started. The operation content of the transaction is: modify 1 to 3 and 2 to 4, then the actual record is as follows (simplified):

  A. Transaction start.
  B. Record A=1 to undo log.
  C. Modify A=3.
  D. Record A=3 to redo log.
  E. Record B=2 to undo log.
  F. Modify B=4.
  G . Record B=4 to redo log.
  H. Write redo log to disk.

  I. Transaction commit

 The main consideration in the design of Undo + Redo is to improve IO performance and increase database throughput. It can be seen that BDEGH are all new operations, but BDEG is buffered to the buffer area, and only G adds IO operations. In order to ensure that Redo Log can have better IO performance, InnoDB's Redo Log is designed as follows Features:

  1. Try to keep the Redo Log stored in a continuous space. When the system is started for the first time, the log file space will be completely allocated, and the Redo Log will be recorded in a sequential append mode, and the performance will be improved through sequential IO.
  2. Write logs in batches. The log is not written directly to the file, but is written to the redo log buffer first. When the log needs to be flushed to disk (such as transaction commit), many logs are written to disk together.
  3. Concurrent transactions share the storage space of Redo Log, and their Redo Logs are recorded together alternately according to the execution order of statements to reduce the space occupied by the log. This will cause the logs of other uncommitted transactions to be written to disk.
  4. Redo Log only performs sequential append operations. When a transaction needs to be rolled back, its Redo Log records will not be deleted from Redo Log.

How to recover?

As mentioned above, uncommitted transactions and rolled back transactions will also be recorded in the Redo Log. Therefore, these transactions must be handled specially when recovering.

Due to the characteristics of redo log itself, it is impossible to redo only the committed transactions. But it can be done, redo all transactions, including uncommitted transactions and rolled back transactions. Then roll back those uncommitted transactions through Undo Log.

The recovery mechanism in the InnoDB storage engine has several characteristics:

When redoing the Redo Log, I didn't care about transactionality. When recovering, there is no BEGIN, no COMMIT, ROLLBACK behavior. Do not care which transaction each log is. Although transaction-related contents such as transaction ID will be recorded in Redo Log, these contents are only regarded as part of the data to be operated.

To make the Undo Log persistent, the corresponding Undo Log must be written to disk before writing the Redo Log. The association between Undo and Redo Log makes persistence more complicated. In order to reduce complexity, InnoDB treats Undo Log as data, so the operation of recording Undo Log will also be recorded in the redo log. In this way, the undo log can be cached like data instead of being written to disk before the redo log.

Since Redo is not transactional, it will re-execute the transaction that was rolled back. At the same time, Innodb will also record the operation when the transaction is rolled back to the redo log. The rollback operation is essentially to modify the data, so the data operations during the rollback will also be recorded in the Redo Log. The operation of a rolled back transaction during recovery is to redo first and then undo, so data consistency will not be destroyed.

Source reference: https://zhuanlan.zhihu.com/p/190886874 , https://www.cnblogs.com/wyy123/p/7880077.html , https://www.cnblogs.com/drizzle-xu/p /9713513.html

And Lin Xiaobin "45 Lectures on MySQL Actual Combat"

Guess you like

Origin blog.csdn.net/qq_24436765/article/details/110493416