Comparison of common technologies and schemes of database recovery subsystem (2)

Author: laboratory Chen / Big Data Laboratory
last article, "Common technical and program comparison database recovery subsystem (a)", we introduce the basic database management system Logging & Recovery recovery subsystem, detailed The concept and technical implementation of ARIES, a mainstream recovery algorithm based on Physical Logging, are discussed. This article will share with you the introduction of the principle of Logical Undo Logging and the recovery technology of the two database systems SQL Server (Azure) and Silo from Professor Gong Xueqing of China Normal University.

— Logical Undo Logging —

In the last article, we briefly introduced the optimization idea of ​​Early Lock Release: release the Lock on the index in advance to improve the degree of concurrency, but at the same time, there will be dependencies between transactions, resulting in cascading rollback. For example, the first transaction has released the lock, but a failure occurs when the log is flushed and needs to be rolled back. At this time, the lock has been acquired by the next transaction, and the next transaction must be rolled back together with the previous transaction, which greatly affects system performance.

Based on this situation, Logical Undo Logging is introduced into the recovery system, which solves the above problems to a certain extent. The basic idea of ​​Logical Undo Logging is to undo the modification operation when the transaction is rolled back, rather than modifying the data. For example, the undo of insertion is a deletion, and the undo of a data item +1 is a data item -1.

In addition, when dealing with Logical Undo, the concept of Operation Logging is introduced, that is, a special logging method is adopted for some operations of the transaction when recording logs, which is called Transaction Operation Logging. When Operation starts, a special Operation Logging will be recorded in the form of log<Ti, Oj, operation-begin>, where Oj is the unique Operation ID. During the operation of the operation, the system normally records Physical Redo and Undo logs; after the operation ends, it will record the special Logging <Ti, Oj, operation-end, U>, where U is the set of operations that have been performed Logical Undo operation.

For example, if the index inserts a new key-value pair (K5, RID 7) into the leaf node Index I9, where K5 is stored in the X position of the Index I9 page, replacing the original value Old1; RID7 is stored in the X+8 position to replace the original value Old2. For this transaction, when recording the log, the operation start recording <T1, O1, operation-begin> will be added at the top, the middle Physical Logging normally records the update operation of the value, and the end log <T1, O1, operation-end, ( delete I9, K5, RID7)>, where (delete I9, K5, RID7) is the Logical Undo for T1 operation, that is, delete K5 and RID7 on the I9 node page.
Comparison of common technologies and schemes of database recovery subsystem (2)
After the operation is finished, the lock can be released in advance, allowing other transactions to successfully insert <Key, Record ID> into the page, but causing all Key values ​​to be reordered, making K5 and RID7 leave the positions of X and X+8. If you perform Physical Undo at this time, you need to withdraw K5 and RID7 from X and X+8, but the positions of the two have changed in practice. It is not realistic to perform Physical Undo according to the original log record. But from a logical point of view, Undo only needs to delete K5 and RID7 from the index node I9, so the operation log (delete I9, K5, RID7) is added to the log, which means that if you do Undo, you only need to follow this logical instruction. That's it.

During rollback, the system scans the log: If there is no operation-end log record, Physical Undo is performed. This is because the lock will be released only when the Operation ends. Otherwise, it means that the lock has not been released and other transactions will not be released. May modify the locked data items. If an operation-end log is found, the lock has been released. At this time, only Logical Undo can be performed, and all logs between the begin operation and the end operation are skipped, because the Undo of these logs has been replaced by Logical Undo. If the <T, O, operation-abort> log is found during Undo, it means that the operation has been Abort successfully, and you can skip the intermediate log directly and return to the <T, O, operation-begin> of the transaction and continue on Undo ; When Undo reaches the <T,start> log, it indicates that all logs of the transaction have been undone. Recording the <T, abort> log indicates the end of Undo.

It should be noted that all Redo operations are Physical Redo. This is because the implementation of Logical Redo is very complicated. For example, the order of Redo needs to be determined. Therefore, all Redo information in most systems is Physical. In addition, Physical Redo does not conflict with the early release of the lock, because Redo will only proceed when there is a failure. At this time, the system hangs and all the locks are gone, and the lock must be re-applied when redoing.

— SQL Server:Constant Time Recovery —

For most commercial database systems such as SQL Server, the ARIES recovery system is mostly used. Therefore, the workload of all uncommitted transactions of Undo is directly proportional to the content of the work done by each firm. The more operations the transaction does, the longer it takes to Undo. Because the transaction operation may be a statement but update many records, Undo needs to undo the records one by one, which may lead to too long undo time. Although the correctness is guaranteed, it is difficult to accept cloud services or systems with high availability requirements. In order to deal with this situation, CTR optimization technology (Constant Time Recovery) has emerged, which combines the ARIES system and multi-version concurrency control to achieve fixed time recovery. Fixed-time recovery refers to the use of multi-version information in the database system to ensure that the recovery operation is completed within a certain time regardless of the situation encountered. The basic idea is to use different data versions in the multi-version database system to ensure that the data Undo reaches a correct state, instead of restoring with the information in the original WAL log.

  • Concurrency Control of MS-SQL

SQL Server began to introduce multi-version concurrency control in 2005, but early multi-versions were only used to achieve the Snapshot isolation level rather than recovery, and support the system to read data in the database based on the Snapshot timestamp. When updating data, multi-version concurrency control can update records in place on the data page, but the old version is not lost, but is placed separately in the Version Store. Version Store is a special table that only allows continuous addition of data (Append-Only). The old version of the record is pointed out through the pointer link, and the old version will point to the previous older version, thus forming a data version chain. When accessing data, you can decide which version of data to read based on the timestamp of the transaction. Therefore, early multi-version Version Store updates do not need to record logs, because once a failure occurs, all timestamps will be the latest after restart, as long as the last version is maintained, there will be no earlier snapshot access requirements than this version . For the current Active transaction, the old version can be discarded through the Garbage Collection mechanism according to the current timestamp.
Comparison of common technologies and schemes of database recovery subsystem (2)
CTR is optimized based on the original Version Store and implements the Persistent Version Store, which enables the persistent storage of the old version. Under the CTR technology, when the system updates the Version Store, it will record a log to prepare for recovery, which makes the volume and overhead of the Version Store larger, so there are two strategies to implement the old version storage: In-Row Versioning and Off-Row Versioning.

In-Row Versioning refers to the operation of updating a record. If only a small amount of data is changed, it does not need to be stored in the Version Store. Instead, a delta value is added after the record to indicate the value change of the attribute. The purpose is to reduce the overhead of Versioning, because the disk I/O changed at the same location is relatively small.

Off-Row Versioning has a special system table, used to store the old version of all tables, and record the Redo record of Insert operation through WAL. When the amount of modification is large and In-Row Versioning cannot completely save the data update, the Off-Row Versioning method is adopted. As shown in the figure below, the Col 2 in row A4 is 444. After updating to 555, a delta will be written to record the version change. However, this modification is limited by the amount of data and whether there is free space on the page where the record itself is located. If there is free space, it can be written. If not, the updated record needs to be placed in the Off-Row Versioning table.
Comparison of common technologies and schemes of database recovery subsystem (2)
During the recovery process, CTR is implemented in three stages (Analytics, Redo and Undo). The analysis phase is similar to ARIES, used to determine the current status of each transaction such as Active, Commit and transactions that require Undo. When in Redo, the system will replay the main table and the Version Store table, and they will all be restored to a crash state. After Redo is completed, the database can be launched for external services. The third step of CTR is Undo. After the analysis phase is over, it is known which transactions have not been committed. The Undo phase can directly mark these transactions as Abort. Since different versions of each Record will record the transaction number related to that version, when subsequent transactions read this version, they first judge the status of the relevant transaction. If it is Abort, ignore the version and read the previous version. This recovery method makes it necessary to find the previous version according to the link when reading an unavailable version. Although it will bring additional performance overhead, it reduces the offline time of the database. After continuing to provide services, the system can perform Garbage Collection in the remaining time to slowly clear out the invalid old version. This mechanism is called Logical Revert.

  • Logical Revert

There are two ways to Logical Revert. The first is to use the background process Background Cleanup to scan all data blocks to determine which Garbage can be recycled. The judgment condition is: if the last version in the main table comes from a transaction that has been Abort, then take an old version that has been committed from the Version Store and put it in the main table. Even if you don't do this at this time, you will read the data in the Version Store when you use the data later. Therefore, the version can be migrated slowly through the Garage Collection process in the background. The second is that if the transaction finds that the version in the main table is the version of the Abort transaction when updating the data, it will overwrite that version, and the correct version of the transaction should be in the Version Store at this time.

It can be seen that the recovery of CTR is a fixed time, as long as the first two phases are over, and the time required for the first two phases is actually only related to the Checkpoint of the transaction. If the checkpoint interval is determined according to a fixed log size, when the Redo phase ends, the database can resume work, and the recovery time will not exceed a fixed value.

— Silo:Force Recovery —

Silo is a high-performance in-memory database system prototype jointly researched by Harvard University and MIT, which solves the problem of reduced throughput caused by high concurrency. If a CPU core corresponds to a thread to execute transactions, in the absence of competition, the throughput rate increases with the increase in the number of cores, but it will decrease when it reaches a certain level. It may be due to some resource competition. The bottleneck caused. Although each thread is executed separately during the execution of the transaction, all transactions need to get the transaction ID before committing. The transaction ID is global. The transaction obtains the ID when committing through the atomic operation atomic_fetch_and_add (&global_tid). The assignment of the transaction ID is achieved through the global Manager role. When the transaction applies for the ID, the Manager will update the transaction counter by counting +1 to ensure that the transaction ID is globally unique and incremented. Therefore, the speed of the Manager write operation will be the upper limit of the system performance. . When the concurrency is getting higher and transactions are applying for ID, there will be competition, which makes the waiting time longer, resulting in a decrease in throughput.
Comparison of common technologies and schemes of database recovery subsystem (2)

  • Optimistic concurrency control

For performance bottlenecks, Silo's solution is to use optimistic concurrency control in a multi-core concurrent work + shared memory database. Optimistic concurrency control was introduced in "In-memory database analysis and mainstream product comparison (3)". It refers to transactions that are considered to have no impact on each other during execution, and only check whether there is a conflict when submitting. If there is no conflict, apply for the global The transaction ID completes the Commit. Silo cancels the required global transaction ID by designing Force Recovery, and uses the concept of Group Commit to divide the time into multiple Epochs, each of which is 40 milliseconds. Epoch contains multiple transactions involved in the current time period, so this group of transactions can be submitted together by submitting Epoch, and there is no need to apply for a global transaction ID for each transaction one by one. But the flaw of this design is that if the transaction execution time exceeds 40 milliseconds, the cross-level brought about will affect the submission and recovery.

In Silo, each transaction is distinguished by Sequence Number + Epoch Number. Sequence Number is used to determine the order of transactions in the execution process, and the recovery strategy is jointly determined by Sequence Number and Epoch Number. Each transaction will have a transaction ID (TID), the transaction will be Group Commit according to Epoch, and the overall submission will be serialized according to the Epoch Number. The transaction ID is 64 bits and consists of the status bit, Sequence Number and Epoch Number. The high bit is the Epoch Number, the middle is the Sequence Number, and the first three bits are the status bits. Each record will save the corresponding transaction ID, where the status bit is used to store information such as the latch lock when accessing the record. Compared with the traditional disk-based DBMS database, an important difference between the memory database and the traditional disk-based DBMS database is that the management of the lock is the same as that of the record. At the same time, Data Buffer and Locking Table are not separately managed.
Comparison of common technologies and schemes of database recovery subsystem (2)

  • Three stages of transaction commit

Since Silo uses standard optimistic concurrency control, it will only check for conflicts when submitting. In the pre-commit stage, when reading the data item, the transaction ID in the data item will be stored in the Local Read-Set, and then the value will be read; and when the data record is modified, the modified data record needs to be placed in the Local Write- In the Set.

The Silo transaction is submitted in three stages. The first step is to obtain the lock for all the records to be written in the Local Wite-Set. The lock information is stored in the status bit of the transaction ID and obtained through the atomic operation Compare and Set; then from the current Epoch Read all transactions. There is a dedicated thread in the system responsible for updating the Epoch (every 40ms+1), all transactions will not compete to write the Epoch Number but only need to read the value. The second step of transaction submission is to check the Read-Set. Each data item in Silo contains the ID of the last transaction that updated it. If the TID of the record changes or the record is locked by other transactions, it means that from read to commit During the process, the record has been changed and rollback is required. Finally, the transaction TID is generated. When a new transaction commits, the Sequence Number in the TID should be a minimum value greater than the transaction ID of all the records read to ensure that the transaction is incremented. After submission, the result will only be returned after all transactions are placed.

* Recovery——SiloR

SiloR is the recovery subsystem of Silo. It uses Physical Logging and Checkpoints to ensure the durability of transactions, and uses parallel recovery strategies in these two aspects. As mentioned earlier, the log write operation in the memory database is the slowest, because the log involves writing to the disk, and disk I/O is the performance bottleneck of the entire system architecture. Therefore, SiloR writes logs concurrently, and there is the following assumption: each disk in the system has a log thread to serve a group of worker threads, and the log thread and a group of worker threads share a CPU Socket.

Based on this assumption, the log thread needs to maintain a log buffer pool (there are multiple log buffers in the pool). If a worker thread wants to execute, it first needs to ask the log thread for a log buffer to write logs, and when it is full, return the log thread to the disk, and at the same time obtain a new log buffer; if there is no available log buffer, the worker thread will block. The log thread will periodically flush the buffer to the disk. After the buffer is emptied, it can continue to be handled by the worker thread. For log files, a new log file is generated for every 100 Epochs. The old log files generate file names according to fixed rules. The last part of the file name is used to identify the largest Epoch Number in the log; the log content records the TID and Record update of the transaction. Collection (Table, Key, Old Value -> New Value).
Comparison of common technologies and schemes of database recovery subsystem (2)
The above introduces what a core in a multi-core CPU handles. In fact, each core in the CPU works in the same way. There is a dedicated thread to track which logs have all been flushed to the disk, and the latest has been saved. Epoch writes to a fixed location on the disk. All transactions compare their current Epoch with the Persistent Epoch (hereinafter referred to as Pepoch) that has been placed on disk. If it is less than or equal to Pepoch, it indicates that the log has been placed on disk and can be returned to the client.

  • Recovery process

Like the ARIES recovery system, SlioR also needs to be Checkpoints. The first step of SlioR is to read out the data from the last checkpoint and then recover; since the in-memory database does not log the index, the index needs to be reconstructed in memory. The second step is log playback. The in-memory database only does Redo but not Undo. Redo operations are not performed in the normal log sequence like ARIES, but are performed directly from back to front to update to the latest version. During log playback, first check the Pepoch log file to find out the latest Pepoch number. Any log record exceeding the Pepoch number is considered that the corresponding transaction has not been returned to the client, so it can be ignored. The log file recovery uses Value Logging. For each log, check whether the data record already exists, if it does not exist, generate a data record based on the log record; if it exists, compare the record with the TID in the log. If the TID in the log is greater than the TID in the Record, Redo is required, and the new data value in the log is used to replace the old value. It can be seen that Redo of SiloR does not restore to the failure site step by step like ARIES, because the purpose of ARIES is to restore the data on the disk to the final correct state, while SiloR is to restore the data in the memory to the latest correct state.

— In-Memory Checkpoint —

For the memory database, the recovery system is simpler than the disk DBMS system, as long as the data is logged, the index is not required, and only the Redo log is not required for the Undo log. All data is directly overwritten and modified, no Dirty Page management is required, and there will be no Buffer Pool and buffer placement problems. But the in-memory database is still limited by the log synchronization time overhead, because the log still has to be flushed to non-volatile storage (disk). In the early 1980s, in-memory database research assumed that memory data would not be lost, such as battery-equipped memory (which can be used for a period of time after power failure); and non-volatile memory (NVM) (Non-Volatile Memory). ) And other technologies, but these technologies are still far from universal use, and persistent storage must still be considered.

Persistence system requirements have minimal impact on performance, and cannot affect throughput and latency. In order to reduce the impact on transaction execution, the first goal pursued by in-memory databases is recovery speed, that is, recovery can be completed in the fastest time after a failure. Therefore, serial recovery cannot meet the speed requirements. At present, a lot of research work is focused on the parallel log research of in-memory databases. Parallel makes the implementation of locks and Checkpoints complicated.

For In-Memory Checkpoints, the implementation mechanism is usually closely integrated with concurrency control, and the concurrency control design determines how checkpoints are implemented. The first requirement of Checkpoints in an ideal In-Memory is that it cannot affect normal transaction execution, and cannot introduce additional delay and occupy too much memory.

Checkpoints types : Checkpoints are divided into two types : Consistent Checkpoints and Fuzzy Checkpoints. Consistency Checkpoints means that the data in the generated checkpoints does not contain any uncommitted transactions. If it does, uncommitted transactions will be removed during Checkpoints; Fuzzy Checkpoints include committed and uncommitted transactions, and the uncommitted ones will only be removed during recovery. The transaction is removed.

Checkpoints mechanism : The Checkpoints mechanism is divided into two types. The first one can realize Checkpoint by developing the database system's own functions, such as using multi-version storage to make Snapshot; the second one can copy the child of the process through the Folk function at the operating system level. Process; then all the data in the memory is copied again, but additional operations are required to roll back the modifications that are being executed and have not yet been committed.

Checkpoints content : There are two types of checkpoints in content . One is to copy the data in full each time; the other is incremental checkpoints, each time only the incremental content generated between this time and the previous time. The difference between the two is the difference between the amount of data during Checkpoint and the amount of data required for recovery.

Checkpoint frequency : There are three kinds of Checkpoint frequency. One is a time-based regular checkpoint; the other is a checkpoint based on a fixed log size; the last one is a mandatory checkpoint, such as a mandatory checkpoint after the database is offline.

— Summary of this article—
In this two article columns, we introduced the recovery subsystem of the database system, and introduced the concepts and technical implementation of the mainstream recovery systems of Physical Logging, ARIES and Logical Undo Logging. In addition, we introduced the recovery strategies of two database systems-SQL Server's CTR fixed time recovery and the in-memory database system Silo's Force Recovery recovery principle. The next lecture will discuss the concurrency control technology of the database system.

references:

  1. C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking And Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database Syst. 17, 1 (March 1992), 94–162.

  2. Antonopoulos, P., Byrne, P., Chen, W., Diaconu, C., Kodandaramaih, R. T., Kodavalla, H., ... & Venkataramanappa, G. M. (2019). Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment, 12(12), 2143-2154.

  3. Zheng, W., Tu, S., Kohler, E., & Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 465-477).

  4. Ren, K., Diamond, T., Abadi, D. J., & Thomson, A. (2016, June). Low-overhead asynchronous checkpointing in main-memory database systems. In Proceedings of the 2016 International Conference on Management of Data (pp. 1539-1551).

  5. Kemper, A., & Neumann, T. (2011, April). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering (pp. 195-206). IEEE.

Guess you like

Origin blog.51cto.com/15015752/2554384