Understand the principle of fault repair in database logs

I. Introduction

Whether in the database or other business systems, logs are very important. Logs usually have the following functions in the system:

1.      Positioning business problems. In the system development, who didn't write a few bugs? With the log, it is convenient to quickly locate the problem and repair the system. This is where we use logs the most.

2.      System operation process monitoring. The geese must leave traces, and the system can be verified through the log to ensure that the system is running according to the predetermined process, not the way you think. After all, the way a computer runs is what it thinks, not what you think.

3.      Security audit. In the era of paperless office, anyone who does something without deleting records can bury his head in the sand like an ostrich, thinking that everything is safe.

4.      Fault repair. This is basically the standard configuration of the database. Whether it is MySQL or Redis, whether it is Mongodb or ES, there are logs to assist in repairing the system.

The focus of this article is how the database uses logs to repair faults.

 

 

2. System failure

In the book "Concepts and Technology of Transaction Processing", the sources of all aspects of failures are recorded: environment, operation, maintenance, hardware, software, process, etc.

As users, we hope that the system can work normally after the fault is repaired. In other words, we hope that the system can tolerate some faults (fault tolerance) to ensure that the system works longer. 996 is not enough, and 7*24 is best. After all, in the era of automated business processes, the efficiency of the system in processing business processes is too high (for example: Alipay helps Tmall Double 11 Ocean Base to process a peak of 61 million times per second). The cost of downtime is too great. For e-commerce and finance, every transaction processed is worthless money. Once the system goes on strike, the loss is not as trivial as one person asks for leave or resigns.

Although the probability of failure is small, it is fatal and has to be prevented.

 

3. Countermeasures

 

In "Concepts and Technology of Transaction Processing", the understanding of transaction is very simple: the change of database state. For example, if I pay a salary, the balance in my bank card will increase. This balance is a status of the bank card. In the business process of payroll, the other end involved is that the company's bank card has less money.

The database needs to ensure that the status changes of the subjects involved in the business process are consistent, otherwise the company's accounts will be messed up.

This leads to the characteristics of the transaction: ACID. ACID's statute is actually similar to the contract law. When the process is normal, it is not used at the bottom of the box. When there is a dispute in the process, the contract must be used to solve the problem. In fact, to understand ACID, it is easier to understand with the actual example of marriage. Be more rigorous, take China's "Marriage Law" as an example.

Atomicity-A : At the beginning, the status of both people was [unmarried], and the moment they received the marriage certificate, both of them changed to [married]. The state of the two people is tied together.

Consistency-C : The status of both people has changed from [unmarried] to [married]. The constraint of consistency is hidden here: the man can only have one current wife, and the woman can only have one current husband. It is the same as the constraint of the unique index of the database.

Isolation-I : Two couples whose marriage has nothing to do with other people and does not affect other couples who are not in evidence.

Persistence-D : Unless divorced, the marital status is continuous.


The isolation of the transaction is achieved by the lock mechanism, and the atomicity, consistency and durability are guaranteed by the transaction log.

 

After understanding the basic concepts, let's do a thought experiment. If our business process is interrupted due to a code bug (for example, division by 0), how to ensure database consistency constraints? That is, the ultimate goal of our exception handling is to ensure the consistency of the state in the database and ensure that the system meets the requirements of business specifications.

 

undo log

If the status is abnormal, it will be OK to restore the scene. For example, to practice driving at a driving school is to roll back over and over again. We record the state before the transaction change through the undo log. If the current transaction processing is abnormal, roll back to the original state, as if this thing did not happen. Use T to represent the transaction, X to represent the element changed by the transaction, and v to represent the original value of X. Then <T,X,v> are the three necessary fields of the log record. For example: <Pay Salary, Zhang San, 0> means that there is no money on the bank card of Zhang San of the Moonlight Clan before the salary is paid.

 

The following key log information will be displayed in the system processing transaction:

The log <START T> indicates the beginning of a transaction.

The log <COMMIT T> indicates the commit of a transaction, and the log <ABORT T> indicates the abort of a transaction. For a transaction, <COMMIT T> and <ABORT T> will only record one.

 

The processing flow of a transaction is as follows:

S1: Record <START T> means to start a transaction.

S2: Record the undo log: <T,X,v>, indicating the state before the transaction element is changed. (Requires persistence)

S3: Change the X element to change its state. (Requires persistence)

S4: Record <COMMIT T>, indicating that the transaction is committed.

 

When the system fails, suspend the provision of services and start the fault repair procedure.

S1: Traverse the log in reverse, no need to deal with transactions that already have <COMMIT T>.

S2: For transactions without <COMMIT T>, use the undo log to restore the transaction to the state before the change.

Note that each operation of fault repair constitutes a new transaction.

 

The Undo log repair process leads to two concepts: checkpoint and idompotent.

Checkpoint: Because the probability of system failure is generally low, the accumulated log size is huge. If you traverse the entire history log, the efficiency of fault repair is too low. Therefore, the checkpoint mechanism is used to record the system's fault-free milestones. If there is a failure, the repair process can be stopped by rolling back to the milestone.

Idempotent: If the fault is repaired, it also fails. How to do? Ensure that the operation of log repair is idempotent. With this restriction, retrying will solve the problem. The so-called idempotence, the result of executing it once is the same as executing it multiple times. In "Concepts and Technology of Transaction Processing", there is a very vivid example:

"Moving the reaction rod of a nuclear reactor down by 2cm" is non-idempotent.

"Moving the reaction rod of a nuclear reactor to the xx position" is idempotent.

 

Redo log

 

Using undo logs can cause performance problems. That is, the transaction cannot be committed before the data changed by the transaction is written to the disk. This means that every transaction operation has at least 3 disk operations. The so-called press the gourd to float the scoop. There must be a new mechanism, and this is the source of Redo logs.

 

If the Undo log represents a negative attitude toward life, the Redo log is a positive attitude toward life. Quite the momentum of not breaking Loulan and not returning it. That is, if the system fails, since we have recorded the end point we are going to reach, then it is enough to retry if it fails, what to return!

Similar to undo log, redo log records transaction <T,X,v> on behalf of transaction T to change the value of element X to v, and v is the new value.

The transaction flow using redo log is as follows:

S1: Record <START T> means to start a transaction.

S2: Record the redo log: <T,X,v>, indicating the state before the transaction element is changed. (Requires persistence)

S3: Record <COMMIT T>, indicating that the transaction is committed. (Requires persistence)

S4: Change the X element to change its state.

 

Compared with the undo log, COMMIT T is ahead of schedule. In this case, whether the transaction is completed is still to check whether there is COMMIT information in the log, but this process can no longer guarantee that the transaction data with COMMIT is solidified to the hard disk. This requires that when the checkpoint is generated, it must be ensured that the data has been placed on the disk.

The fault repair process is also quite simple:

S1: Find out all transactions that have been COMMIT

S2: Use the redo log to re-execute the operation of the original transaction

Undo / Redo Nisshi

The Redo log requires that all modified blocks be kept in the buffer before the transaction is committed and the log record is flushed, which may increase the average number of buffers required by the transaction. Moreover, if the database element is not a complete block, the undo log and the redo log have conflicts on how to deal with the buffer during the checkpoint process. So the new plan has been released, it is the undo/redo complex, which means it will kill you three thousand. That is, when recording the log, record the 4 values ​​of <T,X,v,w>, which means that the transaction T changes the element X, the old value of x is v, and the new value is w.

The transaction flow using undo/redo logs is as follows:

S1: Record <START T> means to start a transaction.

S2: Record undo/redo log: <T,X,v,w>, transaction T changes element X, the old value of x is v, and the new value is w. (Requires persistence)

S3: Modify database elements

S4: Record <COMMIT T> log.

There is no clear order requirement for S3 and S4.

Here, it can be flexibly controlled according to the current buffer status. It is no longer important to modify the database element operation of <COMMIT T>.

 

Recovery:

S1: Traverse the log

S2: For the transaction that has been COMMIT, use redo to redo

S3: For transactions without COMMIT, use undo to roll back

Follow-up

This article briefly summarizes the understanding of transactions and ACID. It sorts out the evolution process of undo, redo, and undo/redo mechanisms from the ideological principle, and also sorts out the source of the two concepts of checkpoint and idempotence. Follow-up will analyze the log implementation of MySQL, ES and other databases, combined with the actual understanding of its landing posture in industrial products.

 

 

reference

"Concepts and Technology of Transaction Processing"

"Database System Implementation"



Guess you like

Origin blog.51cto.com/sbp810050504/2667394