Comparison of common technologies and schemes of database recovery subsystem (1)

Author: laboratory Chen / Big Data Laboratory

For transactional databases, the most critical function is to ensure the ACID properties of things, where atomicity and durability are guaranteed by the recovery subsystem. If the transaction is in progress, if it is found that it cannot continue, it needs to be rolled back with the recovery subsystem; or a system crash occurs, and the database needs to be restored to the state before the crash. In this column, we mainly introduce Logging Protocols / Recovery Algorithms, which are the two key parts of the transaction database recovery subsystem.

— Logging Schemes—

The key to the recovery subsystem is the recovery algorithm, which aims to achieve two processes. The first is the preparation for system recovery during the execution of the transaction. At present, most systems usually use log recording. Although there will be additional overhead to record data and update the log during the execution of the transaction, if there is no log, it cannot be achieved once the system crashes. System recovery and rollback of outstanding transactions. In addition, there is also the Shadow Paging scheme, that is, every modification of data is carried out through Copy-on-Write. When updating data, make a copy of the original data and update on the copy, and finally complete the operation by replacing the original data with the copy. The Shadow Paging solution is expensive and is generally used in infrequently updated scenarios, such as text editors and similar scenarios. Therefore, log-based solutions are mostly used in transactional database systems. The second process is how to use the log information recorded by the system and adopt appropriate strategies to ensure that the database can be restored to the correct state in the event of a system failure or transaction rollback.

  • Physical Logging & Logical Logging

Logging is divided into two categories: Physical Logging and Logical Logging. Physical Logging refers to recording the modification of data items in the log record. For example, the value of data item A before modification is 90 and after modification is 100, Physical Logging will record the change process of data item A. In a database system, Physical Logging may be Value Logging, which means recording data items, data item IDs, and attribute values ​​before/after modification; it may also be real physical logging, which means recording PageID, Offset and length of disk pages. And the value before and after modification.

The other type is Logical Logging, which does not record execution results, but only records operations on data modification such as delete / update. Compared with the recovery or replay of Physical Logging based on the values ​​before and after the modification, Logical Logging needs to re-execute the operation in the log during replay, and needs to perform the reverse operation of the operation recorded in the log during rollback, such as inserting the corresponding deletion, etc. .

  • Physical Logging VS Logical Logging

Both types of Logging have their own advantages and disadvantages. Logical Logging records less log content, such as update operations, Logical Logging only needs to record one update statement, and the log recording overhead is low. The disadvantage is that it is difficult to implement in a concurrent scenario. When multiple transactions generate update operations at the same time, the database will schedule these operations as a serialized sequence for execution. A mechanism is needed to ensure the execution order and scheduling of each playback operation. The order is the same. Therefore, most database systems use Physical Logging to ensure the consistency of data recovery. The execution order of transaction operations generated by the transaction manager (concurrency control subsystem) will be recorded in the form of logs, and the recovery subsystem can guarantee according to the log order The rollback and replay of each data item modification are strictly executed in sequence. But at present, some database systems still use Logical Logging, such as the in-memory database engine VoltDB. This is because the VoltDB engine is designed without concurrency control. Each CPU core executes all operations in sequence, so it can be played back in order through Logical Logging.
For the database management system, to ensure the durability and correctness of the data in the event of a failure, it is indispensable to restore the subsystem. During the execution of the transaction, it can be rolled back when it needs to be undone to ensure atomicity; but at the same time, the recovery of the subsystem will bring performance impact, because all log records can only be flushed to the disk before they are truly placed, even if all operations of the transaction are completed. , And you must wait for the log to be placed before responding to the client. Therefore, the performance of Logging often becomes the performance bottleneck of the entire system.

— Recovery System Optimization—

For the optimization of the log or recovery subsystem, there are mainly two types of technologies, one is Group Commit, and the other is Early Lock Release.

  • Group Commit

Group Commit is to flush a group of transaction logs executed in parallel to disk together, rather than flushing each log once per transaction. The log has a separate log buffer. All transactions are written into the log buffer first, and the contents of the log buffer are periodically flushed to the disk by setting a separate thread, or flushed to the disk when the log buffer is full.

The operating system provides different ways to write to disk such as Sync, Fsync, and Fdatasync. When Sync flushes the data to the operating system file buffer, it is regarded as the end, and then the content of the buffer is flushed to the disk by the background process of the operating system. Flushing the disk in Sync mode may cause data loss. The database system usually uses Fsync for log storage, and returns when the record is actually written to the disk. Fsync is to write file data and file metadata such as modification time, file length and other information to disk together; the difference between Fdatasync and Fsync is that it only refreshes data without refreshing metadata.

In some DBMSs, Fsync and Fdatasync are mixed: When metadata modification does not affect Logging, for example, only the file modification time changes, then only Fdatasync can be used; but if the operation modifies the file length, Fdatasync cannot be used at this time, because Fdatasync does not save metadata modification information, which will cause part of the content to be missing during restoration. Since many DBMSs do not increase the content of the log file incrementally when writing the log, but allocate enough space for the log file at one time. The length of the log file remains the same during the subsequent log writing process, so you can use Fdatasync to write the log to Disk. It can be seen that Group Commit uses one system call to write a group of transactions to disk each time, and merges many transaction I/Os, thereby reducing the I/O of the entire system.

  • Early Lock Release

When implementing concurrency control based on the lock mechanism, if the lock of the previous transaction is not released, the subsequent transaction can only be in a state of waiting for the lock. The black part in the figure indicates the ongoing transaction operation, and the gray part is the time waiting for the log to be placed on the disk. Although the data is not modified at this time, the lock can only be released after the log is flushed to the disk. Early Lock Release is an optimization method for improving performance in this scenario. The strategy is to release the lock when the processing part of the transaction is completed, and then put the log to disk, shorten the time for the next transaction to wait for the lock, and improve the degree of concurrent execution .
Comparison of common technologies and schemes of database recovery subsystem (1)
But this method also has drawbacks. For example, the first transaction has released the lock, but a failure occurs when the log is dropped and needs to be rolled back, but because the lock has been acquired by the next transaction at this time, the next transaction must be together with the previous transaction Rollback, so the system needs to maintain dependencies between transactions. In reality, early lock release technology is widely used in databases. For the index structure, if a certain node in the index is locked, it will have a larger scope of influence, because an index leaf node often involves a series of many data records. If the leaf node is locked, the related records will be locked. Therefore, in the use of indexes, Early Lock Release is usually used instead of a two-phase lockout protocol to shorten the time that data records are locked.

— ARIES algorithm —

In disk-based database systems, the recovery subsystem is mostly implemented based on the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) algorithm. ARIES adopts Steal + No Force management strategy for the management of data buffer and log buffer (the introduction of Steal + No Force is mentioned in detail in "In-memory database analysis and mainstream products comparison (1)"). In ARIES, each log has a sequence number LSN (Log Sequence Number). As shown in the figure below, log LSN 10 is the update operation of transaction T1 writing Page 5; LSN 20 is the update operation of transaction T2 writing Page 3. It should be noted that there will be a transaction end record in the log, indicating that the transaction has been committed and returned to the client, indicating that all operations of the transaction have been completed. If there is only commit but no end in the log, it may mean that the transaction has been completed, but the client may not receive a response.
Comparison of common technologies and schemes of database recovery subsystem (1)
* ARIES three-stage recovery

ARIES's recovery algorithm is divided into three stages: Analysis, Redo, and Undo. The specific details of each stage will be described in detail later.

1. Analysis : After the crash restarts, the system will first read the log file from the disk, analyze the content of the log file, and determine which transactions are in the Active state when the system crashes, and which pages have been modified when the system crashes.

2. Redo : The system reproduces the fault scene according to the log in the redo phase, including restoring the Dirty Page in the memory to the state at the time of the crash, which is equivalent to replaying the log history (Repeating History), and executing each log record. , Including transaction logs without commit.

3. Undo : In the Undo phase, the system starts to undo unfinished transactions. The above figure is a simple example of log record. The system crashes after LSN 60. There is a mark of transaction T2 end in the log, so T2 has been committed, and transactions T1 and T3 have not yet been completed. Transactions T1 and T3 are for P1, P3, and P5. If the modification has been placed on the disk, it needs to be undone from the disk.

  • Data structure of log records

For ARIES recovery subsystem, the recovery process needs to be based on the information stored in Logging. The log in ARIES consists of multiple log records. One log record contains the transaction ID of the modified data item, Page ID + Offset, length, value before and after modification, and additional control information.

ARIES log types include Update, Commit, Abort, End, and CLR (Compensation Log Record). CLR is used to prevent the impact of failures in the process of transaction rollback. When the transaction rolls back, a CLR is recorded every time a log is rolled back. The system can judge which operations have been rolled back through the CLR. If the CLR is not recorded, operations may occur. Roll back twice.
Comparison of common technologies and schemes of database recovery subsystem (1)
During normal log recording, ARIES will record redo and undo information, and the recorded log contains the values ​​before and after the modification. Generally speaking, the log is written sequentially, so the database will arrange a separate disk for the log service in the configuration, not mixed with the disk for storing data records, in order to improve the performance of log writing.

The following is a schematic diagram of log placement in ARIES. The sequence on the right side of the figure represents all logs, the blue part represents the logs that have been placed, and the orange part represents the logs that are still in the log buffer. ARIES will record the Flushed LSN, which represents which buffer logs have been flushed to disk. In addition, each disk block that saves data will record a Page LSN, which is used to indicate the largest log number corresponding to all operations that modify this data Page (that is, the log number corresponding to the last operation that modifies the data Page). When flushing the data in the data buffer to the disk, it is determined whether the data can be flushed to the disk by judging the size of the Page LSN and the flushed LSN. If Page LSN is less than or equal to Flushed LSN, it means that all log records that modify this data page have been placed on disk, so data can also be placed on disk. This is called WAL (Write-Ahead-Log), and the log is always written before the data Disk.

In addition, the Prev LSN is also saved in the log record to correspond to the previous log number of the transaction to which the log belongs. Because all transactions in the system share the log buffer, the generated logs are interspersed together. All LSNs belonging to the same transaction can be connected in series through Prev LSN to find all logs corresponding to the transaction.
Comparison of common technologies and schemes of database recovery subsystem (1)
The Xact Table and Dirty Page Table need to be maintained in the recovery subsystem. Xact Table is used to record the status of all active transactions such as active, commit, abort, end, etc.; it also records the last log number Last LSN generated by the transaction. The Dirty Page Table is used to record which data pages have been modified after being loaded from the disk into the buffer, and the log number Rec LSN of the earliest modification of each page (that is, the first modification operation after the data page is loaded into the buffer Log number).

In addition to recording information in the log, in order to ensure that the recovery can be completed successfully, the database system also needs to use the Master Record to record the LSN of the Checkpoint to ensure that only the most recent Checkpoint is required for each recovery. Because the database system needs to stop when doing Checkpoint (no transaction is allowed to be executed), which is difficult for users to accept, the Checkpoint in ARIES adopts the Fuzzy Checkpoint method, which allows transactions to be continuously executed during Checkpoint. Fuzzy Checkpoint will generate two log records: Begin_Checkpoint and End_Checkpoint. Begin_Checkpoint is responsible for recording the time point when Checkpoint is started, End_Checkpoint records Xact Table and Dirty Page Table, and the LSN of Checkpoint will be written to the Master Record on the disk for persistent storage. The above is all the data structures required for recovery, and the summary of various LSNs is shown in the table below.
Comparison of common technologies and schemes of database recovery subsystem (1)
Comparison of common technologies and schemes of database recovery subsystem (1)

— Transaction recovery of database system —

  • Simple transaction recovery

For simple transaction recovery (the system does not fail, but the transaction does not continue during execution), a rollback is required at this time. When rolling back, the system first finds the latest LSN from the Xact Table to undo, then finds the previous log record through the Prev LSN of the log record and then continues undoing until the entire transaction is completely replayed to the state at the beginning. Similar to normal transaction operations, the data during undo actually needs to be locked, and the compensation log CLR is recorded before rollback. CLR will record the LSN number of undo next to point to the next LSN that needs to be undone. When undo is reached to the LSN of Transaction Start, it records Transaction Abort and Transaction End to indicate the end of the rollback.

  • Failed transaction recovery

In the previous article, we mentioned that the failure recovery of ARIES is divided into three stages. The details of the implementation of the three stages will be described in detail below.

1. Analysis stage

In the Analysis phase, the system obtains the last Checkpoint log record from the Master Record on the disk, reconstructs the Xact Table and Dirty Page Table, and processes subsequent log records sequentially from the Begin_Checkpoint log record. When encountering the end log of a transaction, remove it from the Xact Table; if encountering the commit log of the transaction, update the status of the corresponding transaction in the Xact Table; if encountering other log records, determine whether the transaction is in the Xact Table If not, add it to the Xact Table and update the Last LSN of the transaction in the Xact Table to the LSN of the current log record. In addition, the system will determine whether the updated data Page in the log record is in the Dirty Page Table, if not, add the data Page to the Dirty Page Table and set its Rec LSN as the current log number.

2. Redo stage

In the Redo stage, the system first finds the smallest of all PageRec LSNs in the Dirty Page Table, as the starting position of redo, because the data modification corresponding to the log records in the previous period has been placed on the disk and will not appear in the Dirty Page Table. Then the system starts from the starting position of redo, and executes redo operation (replay) on subsequent update log records (including CLR) in sequence. If the page updated by the operation is not in the Dirty Page Table, or the Page is in the Dirty Page Table but the Rec LSN is greater than the current LSN, or the Page LSN on the disk is greater than the current LSN, it means that the corresponding record of the LSN has been placed on the disk. Skip directly, no need to execute redo. During redo, the system does not need to record logs, because redo only realizes the reconstruction of the entire memory state. If a system failure occurs during redo, the original operation will be repeated.

3. Undo stage

The purpose of the Undo phase is to undo the transactions that were not completed when the system fails. At the beginning, a log set that needs to be undone will be created, and the last log number of each transaction that needs to be rolled back is put into the set, and then the loop processing starts. First, the system selects the largest LSN from the set, which is the last one, to undo. If this log is a CLR compensation log and its undo-next is empty, then the transaction has been undone, and an End log can be recorded to indicate the end of the transaction; If the undo-next of the compensation log is not equal to empty, indicating that there is another log that needs to be undone, then the LSN of the next log is put into the collection; if it is an update log, the log is rolled back and a CLR log is recorded, then Add the Prev LSN of the log to the collection. The system will continue to loop according to the above process until the entire undo collection is empty.

Next, we will sort out the whole process through examples. The system first made Fuzzy Checkpoint, and there were two updates: T1 modified P5, and T2 modified P3. Subsequently, T1 abort was cancelled, and LSN40 recorded the compensation log-roll back LSN10, and then T1 End. Then other transactions continue: T3 modifies P1, and T2 modifies P5. Crash appears at this time, how do I restore it? First, in the Analysis process, the Checkpoint starts to scan backwards, and it is found that T1 has End and does not require redo, and T2 and T3 do not have end for redo. Therefore, there are only T2 and T3 in the Xact Table, and the Dirty Page Table includes P1, P3, and P5. After the analysis is completed, the redo process is performed to restore the failure site, and then the CLR is recorded when each log is undone at T2 and T3, until the original one of each transaction is undone.
Comparison of common technologies and schemes of database recovery subsystem (1)
If a Crash occurs again during the recovery process (as shown in the figure below), the two undo operations have recorded CLRs, the new redo will redo these two CLRs, and the new undo process will not roll back again. The recovery system will continue on the original basis until all transactions are undone.
Comparison of common technologies and schemes of database recovery subsystem (1)

  • ARIES summary

ARIES is a recovery system with mature design that can guarantee the atomicity and durability of transactions. It uses WAL and Steal + No Force buffer management strategies without affecting the correctness of the system. LSN in ARIES is a unique identifier of monotonically increasing log records, which connects all logs of a transaction in a chaining manner. Page LSN is used to record the log number corresponding to the last modification operation of each page, and the system reduces the cost of Recovery through Checkpoint. The whole recovery is divided into three steps: Analysis, Redo, and Undo. The purpose of analysis is to find out which transactions need redo, which pages have been modified, and whether the modifications have been placed; then redo is used to restore the scene of the failure, and undo the transactions that need to be undone. Roll back.

— Summary of
this article— This article introduced the basic concepts of the recovery system Logging and Recovery, and discussed the technical principles of the recovery subsystem ARIES in traditional disk-based database management systems. The next article will continue to explore the database recovery subsystem, discuss the Early Lock Release and Logical Undo in DBMS recovery, and introduce two database recovery techniques and memory database recovery methods.

references:

  1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94–162.

  2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., ... & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, 12(12), 2143-2154.

  3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).

  4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).

  5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.

Guess you like

Origin blog.51cto.com/15015752/2554377