Four characteristics of MySQL transaction and its implementation

Four characteristics of MySQL transaction and its implementation

(1) Brief description

Transaction is an important feature that distinguishes MySQL from NoSQL, and is a key technology to ensure data consistency in relational databases. A transaction can be regarded as the basic execution unit for database operations, and may contain one or more SQL statements. When these statements are executed, they are either executed or not executed. MySQL transaction contains four characteristics:

  • Atomicity: Statements are either fully executed or not executed at all. This is the core feature of a transaction. The transaction itself is defined by atomicity. The implementation is mainly based on the undo log log.
  • Durability: Ensure that data will not be lost due to downtime and other reasons after the transaction is submitted. The realization is mainly based on redo log.
  • Isolation (Isolation): to ensure that transaction execution as far as possible not affected by other transactions. InnoDB's default isolation level is RR. The implementation of RR is mainly based on the lock mechanism, hidden columns of data, undo log, and next-key lock mechanism.
  • Consistency: The ultimate goal pursued by transactions. The realization of consistency requires both database-level guarantees and application-level guarantees.

Let's take InnoDB as an example to talk about these four features and implementation methods.

(2) Atomicity

The atomicity of a transaction is like an atomic operation, which means that the transaction cannot be subdivided, and the operations in it are either done or not done. If a SQL statement fails in the transaction, the executed statement must also be rolled back, and the database returns to the state before the transaction. The atomicity of the transaction indicates that the transaction is a whole. When the transaction cannot be successfully executed, all the statements that have been executed in the transaction need to be rolled back to make the database return to the state where the transaction was not started.

The atomicity of the transaction is achieved through the undo log log. When the transaction needs to be rolled back, the InnoDB engine will call the undo log log to undo the SQL statement and realize the data rollback. So what is undo log?

The undo log is a log provided by the InnoDB engine. As the name implies, the undo log is used to roll back data. When the transaction modifies the database, the InnoDB engine will not only record the redo log (described later), but also generate the corresponding undo log. If the transaction execution fails or rollback is called, causing the transaction to be rolled back, you can use the information in the undo log to roll back the data to the state before the modification.

But undo log is different from redo log, it belongs to logical log. It records information related to the execution of SQL statements. When a rollback occurs, the InnoDB engine will do the opposite work based on the records in the undo log. For example, for each data insert operation (insert), a data delete operation (delete) will be performed when rolling back; for each data delete operation (delete), a data insert operation (insert) will be performed when rolling back; for each data update Operation (update), an opposite data update operation (update) will be performed when rolling back, and the data will be changed back. The undo log has two functions, one is to provide rollback, and the other is to implement the MVCC function.

(3) Endurance

The durability of the transaction means that after the transaction is committed, the changes to the database should be permanent, not temporary. This means that after the transaction is committed, any other operations or even system downtime will not affect the execution results of the original transaction. The durability of the transaction is achieved through the redo log in the InnoDB storage engine.

Redo log (redo log) is the log of the InnoDB engine layer, used to record data changes caused by transaction operations, and record physical modifications of data pages. The InnoDB engine updates the data by first writing the update records to the redo log, and then updating the contents of the log to the disk when the system is idle or according to the set update strategy. This is the so-called write ahead technology (Write Ahead logging). This technology can greatly reduce the frequency of IO operations and improve the efficiency of data refresh.

The redo log has some details that we need to pay attention to. The size of the redo log log is fixed. In order to be able to continuously write the updated record, two flag positions are set in the redo log log, checkpoint and write_pos, which respectively indicate record erasure. Divide position and record write position. This structure is much like a circular queue:

Insert picture description here
The space between write_pos and checkpoint can be used for writing new data, and writing and erasing are all going back and forth, cyclically. When write_pos catches up with the checkpoint, it means that the redo log is full. At this time, you cannot continue to execute new database update statements. You need to stop and delete some records first, execute the checkpoint rule, and flush the dirty data pages and dirty log pages in the buffer. To disk (dirty data pages and dirty log pages refer to data and logs that have not been flushed to disk in memory) to free up writable space.

When talking about redo log, we have to talk about buffer pool, which is an area allocated in memory that contains the mapping of some data pages in the disk as a buffer for accessing the database. When requesting to read data, it will first determine whether it is hit in the buffer pool, and if it is not hit, it will be retrieved on the disk and placed in the buffer pool. When requesting to write data, it will be written to the buffer pool first, and the modified data in the buffer pool will be periodically flushed to the disk. This process is also called dirtying.

When data is modified, in addition to modifying the data in the buffer pool, the operation will also be recorded in the redo log. When the transaction is committed, the data will be flushed according to the redo log record. If MySQL is down, the data in the redo log can be read when restarting, and the database can be restored, thus ensuring the durability of the transaction and making the database crash-safe.

In addition to the flushing of dirty data mentioned above, in fact, when redo log is recorded, in order to ensure the persistence of the log file, it is also necessary to go through the process of writing log records from memory to disk. The redo log log can be divided into two parts, one is the redo log buffer cached in volatile memory, and the other is the redo log file stored on the disk.
In order to ensure that each record can be written to the log in the disk, each time the log in the redo log buffer is written to the redo log file, the fsync operation of the operating system is called (fsync function: included in the UNIX system header The file #include <unistd.h> is used to synchronize all the modified file data in the memory to the storage device). In the process of writing, it also needs to pass through the os buffer in the kernel space of the operating system.

Insert picture description here

(4) Isolation

Atomicity and durability are the properties of a single transaction itself, while isolation refers to the relationship that should be maintained between transactions. Isolation requires that the effects of different transactions do not interfere with each other, and the operation of one transaction is isolated from other transactions. Since a transaction may not only contain one SQL statement, it is very likely that other transactions will start to execute during the execution of the transaction. Therefore, the concurrency of multiple transactions requires that the operations between transactions are isolated from each other.

The isolation between transactions is achieved through the lock mechanism. When a transaction needs to modify a row of data in the database, the data needs to be locked first. For locked data, other transactions do not run operations, and can only wait for the current transaction to commit or roll back to release the lock. The lock mechanism is not an unfamiliar concept. In many scenarios, different implemented locks are used to protect and synchronize data. In MySQL, locks can also be divided into different types according to different classification standards.

  • According to the granularity: row lock, table lock, page lock
  • Divided according to the way of use: shared locks, exclusive locks
  • Divided by thought: pessimistic lock, optimistic lock

Below we briefly describe the categories and characteristics of these locks in order:

1. Granularity (row lock, table lock, page lock)

Insert picture description here

From the perspective of the granularity of the lock, the table lock will lock the entire table when operating data, so the concurrency performance is poor. Row locks only lock the data that needs to be operated, and the concurrency performance is good, but because the lock itself needs to consume resources (acquiring locks, checking locks, releasing locks, etc. all need to consume resources), table locks can be used when there are more locked data Save a lot of resources. Page lock is a kind of lock with granularity between row-level lock and table-level lock, which means to lock the page.

Different storage engines in MySQL can support different locks. MyISAM only supports table locks, while InnoDB supports both table locks and row locks, and for performance reasons, row locks are used in most cases.

InnoDB row locks are implemented by locking index entries. If there is no index, InnoDB will lock the records through a hidden clustered index. In other words, if you do not retrieve data through index conditions, InnoDB will lock all data in the table, and the actual effect is the same as table lock. Because there is no index, you have to scan the entire table to find a record, and you have to lock the table to scan the entire table.

2. Usage (shared lock, exclusive lock)

Shared locks are also called read locks, or S locks for short. As the name implies, shared locks are a lock that can be shared by multiple transactions for the same data and can access the data, but can only be read but cannot be modified.

Exclusive locks are also called write locks, or X locks for short. As the name implies, exclusive locks cannot coexist with others. If a transaction acquires an exclusive lock on a data row, other transactions can no longer acquire other locks on the row, including shared locks. And exclusive lock. The transaction that acquires the exclusive lock can read and modify the data.

Note: For the select statement, InnoDB will not add any locks, that is, multiple select operations can be performed concurrently, and there will be no lock conflicts, because there is no lock at all. For insert, update, and delete operations, InnoDB will automatically add exclusive locks to the data involved. Only query select requires us to manually set exclusive locks.

3. Thought (pessimistic lock, optimistic lock)

Pessimistic Concurrency Control, as its name suggests, refers to a conservative attitude towards data being modified by the outside world (including other current transactions in the system and transaction processing from external systems). Therefore, during the entire data processing process , The data is locked. The realization of pessimistic locking often relies on the lock mechanism provided by the database (and only the lock mechanism provided by the database layer can truly guarantee the exclusivity of data access, otherwise, even if the locking mechanism is implemented in this system, it cannot be guaranteed that the external system will not be modified. data). Pessimistic lock is realized through the commonly used select...for update operation. The process of pessimistic lock is also called: one lock, two check and three update.

Pessimistic concurrency control is actually a conservative strategy of "getting the lock before access", which provides a guarantee for the safety of data processing. But in terms of efficiency, the mechanism of processing locks will cause additional overhead to the database and increase the chance of deadlocks. In addition, in the read-only transaction processing, there will be no conflicts, and there is no need to use locks. This can only increase the system load. It also reduces parallelism. If a transaction locks a row of data, other transactions must wait for the transaction to finish before processing that row of data.

Optimistic Locking (Optimistic Locking) Compared with pessimistic locking, the assumption of optimistic locking believes that data will not cause conflicts under normal circumstances. Therefore, when the data is submitted and updated, the data conflicts will be formally detected. If conflicts are found , Let the user return wrong information, let the user decide how to do it. Optimistic locking is usually implemented using version identification methods, such as MVCC.

Optimistic concurrency control believes that the probability of data races between transactions is relatively small, so do it as directly as possible, and do not lock until commit, so no locks and deadlocks will occur. But if you do this, you may still encounter unexpected results. For example, two transactions read a certain row of the database and write it back to the database after modification. At this time, problems are encountered.

So why should we ensure the isolation of transactions? First of all, we must understand the problems that MySQL may encounter in concurrency scenarios:

1. Dirty Read

If a transaction A writes data to the database, but the transaction has not yet been committed or terminated, another transaction B sees the data written by the transaction A into the database, which is a dirty read.

Insert picture description here

The biggest problem with dirty reads is that data that does not exist may be read. For example, in the above figure, the updated data of transaction B is read by transaction A, but transaction B is rolled back, and all the updated data is restored, which means that the data just read by transaction A does not exist in the database. From a macro perspective, transaction A reads a piece of data that does not exist. This problem is very serious.

So how to solve dirty read? As long as the exclusive lock is added during modification, it will not be released until the transaction is committed. After the shared lock is added during reading, no transaction is allowed to modify the data and can only be read (so that in the process of reading data in transaction 1, other transactions The data will not be modified), then if transaction 1 has an update operation, it will be converted to an exclusive lock, and other transactions have no right to participate in reading and writing, thus preventing the dirty read problem.

2. Not repeatable

In transaction A, the same data is read twice, but the results of the two reads are different. The difference between dirty reads and non-repeatable reads is that the former reads data that has not been committed by other transactions, and the latter reads data that has been committed by other transactions.

Insert picture description here
In InnoDB, non-repeatable reads are implemented by MVCC. The characteristic of MVCC is that at the same time, different transactions can read different versions of data, which can solve the problems of dirty read and non-repeatable read. MVCC actually realizes the coexistence of multiple versions of data through hidden columns of data and undo log. The advantage of this is that when using MVCC to read data, there is no need to lock, thus avoiding the conflict of simultaneous reading and writing.

When implementing MVCC, several hidden columns will be stored in the data of each row, such as the version number and deletion time of the current row when it was created, and the rollback pointer to the undo log. The version number here is not the actual time value, but the system version number. Every time a new transaction is started, the system version number is automatically incremented. The system version number at the beginning of the transaction will be used as the version number of the transaction to be compared with the version number of each row of the query. Each transaction has its own version number, so that when performing data operations within the transaction, the version number comparison is used to achieve the purpose of data version control.

3. Phantom reading

In transaction A, the database is queried twice according to a certain condition, and the number of rows in the results of the two queries is different. This phenomenon is called phantom reading. The difference between non-repeatable reading and phantom reading can be generally understood as: the former means that the data has changed, and the latter means that the number of rows of data has changed.

Insert picture description here
The phantom reading is realized through the next-key lock mechanism. Next-key lock is actually a kind of row lock, but it will not only lock the current row record itself, but also lock a range. This is actually a gap lock. Gap lock can prevent other transactions from modifying or inserting records in this gap.

Although InnoDB uses next-key lock to avoid the phantom reading problem, it is not really serializable isolation (one of the four isolation levels of MySQL). This also shows that avoiding dirty reads, non-repeatable reads, and phantom reads are necessary and insufficient conditions for achieving a serializable isolation level. Serialization can avoid dirty reads, non-repeatable reads, and phantom reads, but avoiding dirty reads, non-repeatable reads, and phantom reads does not necessarily achieve serialization.

Let's look at the four isolation levels of MySQL. The SQL standard defines 4 types of isolation levels, including some specific rules to limit which changes inside and outside the transaction are visible and which are invisible. Low-level isolation levels generally support higher concurrent processing and have lower system overhead. The isolation level is to solve the possible problems in the above three types of concurrency. The following table shows the problems that the four isolation levels can avoid.

Insert picture description here

In the actual database design, the higher the isolation level, the lower the concurrency efficiency of the database will be. And the isolation level is too low, it will cause the database to encounter various messy problems in the process of reading and writing. Therefore, in most database systems, the default isolation level is read committed (such as Oracle) or repeatable read (MySQL's InnoDB engine).

(5) Consistency

Consistency means that after the transaction is executed, the integrity constraints of the database are not destroyed, and the data state is legal before and after the transaction is executed. Consistency is the ultimate goal pursued by transactions. Atomicity, durability, and isolation actually exist to ensure the consistency of the database state.

In other words, the AID in ACID is a characteristic of the database, which depends on the specific implementation of the database. And this C alone actually depends on the application layer, that is, it depends on the developer. Consistency here refers to the transition of the system from one correct state to another correct state. What is the correct state? If the current state satisfies the predetermined constraints, it is called the correct state. The fact that the transaction has the characteristics of C in ACID means that the AID of the transaction is used to ensure our consistency.

September 8, 2020

Guess you like

Origin blog.csdn.net/weixin_43907422/article/details/108455391