In-depth understanding of mysql locks and mvcc

Base

1 Lock explanation: interview essentials - row lock, table lock - the difference and connection between optimistic lock and pessimistic lock

2 Explanation of mvcc mechanism (how to achieve various isolation levels): database foundation (4) Innodb MVCC implementation principle

Just after reading the above two articles, there may be a lot of doubts. Here are some doubts I sorted out by myself:

1 How does the rr level prevent phantom reading

"RR" is the abbreviation of "Repeatable Read" (repeatable read), which is one of the four standard isolation levels of the database, and the other three are "Read Uncommitted" (read uncommitted), "Read Committed" (read committed) ) and "Serializable". Different isolation levels provide different concurrency control, which can make a trade-off between "data consistency" and "concurrency performance".

The repeatable read (Repeatable Read, RR) isolation level ensures that the results of reading the same data multiple times in a transaction are consistent, that is, transactions at this level will not see other transactions modifying the data. This prevents the problem of "non-repeatable reads", where a transaction sees different data at two different points in time when it reads the same row of data.

However, even under the RR isolation level, there may be "phantom read" (Phantom Read) problem. Phantom reading refers to the inconsistency between the results of the first query and the last query during a transaction. This phenomenon is not caused by the values ​​of the records being read being changed by other transactions, but because some records are inserted or deleted by other transactions, causing us to see "phantom" records.

The most important way to prevent phantom reading is to use the highest level of isolation level - serialization (Serializable). At the serialization level, transactions are executed completely serially, and naturally there is no phantom read problem. But this level of isolation comes with a big penalty in concurrency performance.

In practice, some database systems (such as MySQL's InnoDB storage engine) will use "Next-Key Locking" to prevent phantom reading at the RR level. Next-Key Locking is a locking strategy that not only locks the index records accessed by the query, but also locks a "gap" (Gap Locking) in the index, which prevents other transactions from inserting new data into the "gap". record, thus avoiding the generation of phantom reads.

Note: The Next-Key Locking strategy may lead to increased lock conflicts, which may reduce concurrency performance. In practical applications, it is necessary to weigh the use according to the actual situation.

2 next-lock (pro key lock) and gap-lock (gap lock)

the difference between the two

Both Next-Key Locking and Gap Locking are techniques for database management systems (such as MySQL's InnoDB engine) to prevent phantom reading under the Repeatable Read (RR) isolation level. They all involve a kind of "range locking" of data, that is, not only locking the records actually accessed, but also locking the "gap" between records. The main difference between the two is the granularity of locking.

  1. Next-Key Locking : Next-Key Locking is the default row locking method of InnoDB. It is a combination lock, including row lock (record lock) and gap lock (gap lock). Specifically, if a transaction T1 accesses an index record R, then Next-Key Locking will lock the "key value-gap" of R and R itself. In this way, not only other transactions are prevented from modifying R, but also other transactions are prevented from inserting new records in R's key-value-gap. Next-Key Locking can effectively prevent phantom reading.

  2. Gap Locking : Gap Locking only locks the gaps between index records, not the records themselves. For example, if a transaction T1 accesses index records R1 and R2, then Gap Locking will lock the gap between R1 and R2, but not lock R1 and R2 themselves. This means that other transactions can modify the contents of R1 and R2, but cannot insert new records between R1 and R2. Gap Locking is mainly used to achieve read consistency and prevent phantom reads, but it does not prevent non-repeatable reads.

Next-Key Locking is more conservative, which provides stronger data consistency guarantees, but may reduce concurrency performance. Gap Locking is more lenient, allowing concurrent modification of existing records, but still preventing phantom reading. In practical applications, an appropriate locking strategy should be selected according to specific needs.

Using the key lock can perfectly solve the problem of phantom reading. The interviewer asked, is it really perfect?

Summary: Perfect in safety but flawed in performance

"Perfect" might be an overly categorical statement, because even using a key lock doesn't solve all concurrency problems, and it introduces additional overhead.

Proximity key lock is a lock used by InnoDB to prevent phantom reading under the repeatable read isolation level. It can prevent new records from being inserted in a certain range, which can indeed prevent phantom reading from happening. However, this does not solve all concurrency issues. For example, key locks do not protect against "write skew" problems, which can occur when two transactions both read the same data and try to modify them.

In addition, adjacent key locks increase the overhead of the lock, possibly resulting in reduced performance. Adjacent key locks need to lock a range of records, not just a single record, which may prevent other concurrent transactions from accessing the range of records, even if those transactions do not directly conflict with the locked record.

Therefore, although the key lock can effectively solve the problem of phantom reading, whether it can be called "perfect" needs to consider its possible impact on concurrency and performance.

Is repeatable reading guaranteed by row locks?

Repeatable Read (RR) guarantees do not rely solely on row locks. In an actual database system, the RR isolation level usually needs to be implemented by combining multiple technologies, including row locks, gap locks (Gap Lock), multi-version concurrency control (MVCC, Multi-Version Concurrency Control), etc.

Row locks can prevent the same row of data from being modified concurrently by other transactions, but row locks cannot prevent "phantom read" (Phantom Read) problems. Phantom reading means that within a transaction, the results of two queries are inconsistent, and this inconsistency is not due to the modification of the queried records themselves, but due to the insertion of new records or the deletion of old records.

In order to solve the problem of phantom reading, many database systems use gap locks (Gap Lock). Gap locks can lock the "gap" between two index records, preventing other transactions from inserting new records in this gap.

Multi-version concurrency control (MVCC) is another important technology, which allows multiple read transactions to access the same row of data at the same time without blocking each other. In MVCC, each write operation creates a new version of the data row, and read operations access an appropriate old version. This way, each transaction can run in a consistent snapshot as if it were the only one running.

Therefore, the repeatable read isolation level usually requires the joint implementation of multiple technologies such as row locks, gap locks, and MVCC.

A question from the MVCC link

Original text: MVCC saves the historical version of the data and decides whether to display the data according to the version number of the compared data. It can achieve the isolation effect of the transaction without adding a read lock, and finally can read the data at the same time. Modify, data can be read at the same time when modifying, which greatly improves the concurrency performance of transactions.

Question: When the data is read when writing and modified when reading, will there be no concurrency problems? For example, two requests come in at the same time, one reads and the other writes, but the write is executed first, and the read is executed after the read operation. Should the operation be performed on the current version of the data or after the write operation is complete?

Answer: Your understanding of Multi-Version Concurrency Control (MVCC, Multi-Version Concurrency Control) is correct. MVCC saves the historical version of the data and selects the appropriate version when reading the data, so that the read operation and the modification operation can be performed at the same time, thereby improving the concurrency performance of the transaction.

Regarding the question you mentioned, "Will the data be read when it is written and modified when it is read, will there be no concurrency problems?" In fact, there are corresponding rules and strategies in the MVCC model to deal with this situation .

In MVCC, each transaction gets a unique transaction ID (also known as a version number) at the beginning. When a transaction wants to read a data item, it sees the latest version with a version number no greater than its own transaction ID. When a transaction wants to modify a data item, it creates a new version whose version number is the transaction ID of the transaction.

So, for your question, suppose there are two transactions T1 (write) and T2 (read), where T1's transaction ID is smaller than T2's transaction ID. If T1 executes first, it will create a new data version, the version number is the transaction ID of T1; then when T2 executes, it will see the latest version whose version number is not greater than the T2 transaction ID, which is created by T1 that version.

In this way, even if the read operation and the modification operation are performed concurrently, there will be no data consistency problem, because each transaction runs in a consistent data snapshot. In this mode, read operations will not block write operations, and write operations will not block read operations, thereby achieving high concurrency performance.

Note, however, that while MVCC can improve concurrency performance, it cannot solve all concurrency problems. For example, it does not solve the problem of two transactions modifying the same data item at the same time. This situation usually requires the use of locks or other concurrency control mechanisms to handle.

Does transaction T1 make changes to the version data it holds? In this case, is it possible that the data read by T2 has been changed?

In the MVCC (Multi-Version Concurrency Control) mechanism, when a transaction performs a write operation (such as updating a data item), a new version of the data item will be created (in fact, the new version is still placed in the data row, the original The version is located in the undo log line , refer to the demo of mvcc link 2.4), instead of directly modifying the original data. The version number of the new version is equal to the transaction ID of the transaction. A version of the original data is preserved so that other transactions still using the old version of the data can continue to access it.

Suppose there are two transactions T1 and T2, T1 is a write transaction, T2 is a read transaction, their transaction IDs are ID1 and ID2, and ID1 < ID2. When T1 wants to modify a data item, it creates a new version of the data with version number ID1. When T2 wants to read this data item, it will see the latest version whose version number is not greater than ID2. Because ID1 < ID2, T2 can see the new version created by T1.

However, which version of data T2 can see depends on the start time of T1 and T2 and the implementation strategy of MVCC. For example, if T1 has not been submitted when T2 starts, then T2 may only see the data version before T1's modification; if T1 has committed when T2 starts, then T2 may be able to see the data version after T1's modification.

It should be noted that what MVCC can ensure is the consistent view of the transaction, that is, in a transaction, regardless of whether the data is modified by other transactions, the data it sees is always consistent. For concurrent read and write operations, whether there will be conflicts depends on the specific concurrency control strategy and isolation level. In some cases, it may be necessary to use locks or other synchronization mechanisms to prevent concurrency conflicts.

How to understand "What MVCC can ensure is a consistent view of transactions, that is, in a transaction, regardless of whether the data is modified by other transactions, the data it sees is always consistent"

"What MVCC can ensure is a consistent view of transactions, that is, in a transaction, regardless of whether the data is modified by other transactions, the data it sees is always consistent." This sentence means: In multi-version concurrency control Under the (MVCC) mechanism, each transaction sees a consistent and unchanged data snapshot, which depends on the time point when the transaction starts. This is the so-called "consistency view".

How is this possible? When a transaction starts, it gets a unique transaction ID, which also represents the point in time when the transaction started. When the transaction wants to read a data item, it will see the latest version with a version number no greater than its own transaction ID. That is, the transaction can only see the version of the data that existed when it started, not new versions created by other transactions after it started. Even if other transactions modify the data during the execution of the transaction, the data seen by this transaction will not change.

For example, suppose transaction T1 starts at time point 1, and the version of data item A it reads at this time is 1. During the execution of T1, another transaction T2 modifies data item A at time point 2, creating version 2. However, even though the actual version of data item A has changed to 2, T1 still sees version 1 of the data, because version 1 was the version that existed when T1 started. Therefore, no matter how the data changes, the data seen by T1 is always consistent.

This feature makes it possible for read operations to not block write operations and write operations to block read operations under the MVCC mechanism, which can improve concurrency performance. At the same time, since each transaction runs in a consistent data snapshot, transaction consistency can also be guaranteed.

Under the normal use of the lock mechanism, the write operation T1 and the read operation T2, if T1 is executed first, then T2 can only block and wait for the updated data to be read after T1 is executed?

In the case of using the lock mechanism, if a transaction (T1) acquires a write lock on a data item, all other transactions (such as T2) that try to read or write this data item will be blocked until T1 is released the lock. Therefore, in this case, T2 will indeed wait for T1 to finish executing and release the lock before reading the updated data.

This blocking waiting situation may lead to a decrease in concurrency performance, especially in a high-concurrency system. To solve this problem, one way is to use a non-blocking concurrency control mechanism, such as MVCC (Multi-Version Concurrency Control). In MVCC, read operations do not block write operations, and write operations do not block read operations, thus improving concurrency performance.

However, although MVCC can improve concurrency performance, it also has some limitations and overhead, such as the need to maintain multiple versions of data, and the need to deal with version conflicts. Therefore, in practical applications, an appropriate concurrency control strategy should be selected according to specific needs and environments.

If the transaction is executed in the middle of the write operation T1, but T2 reads the data again after the end of T1, is the data read at this time after the submission or before the submission?

In the multi-version concurrency control (MVCC) model, a transaction ID is obtained when a transaction starts, which is also a timestamp, representing the time point when the transaction starts. All read operations on data by this transaction will only see the version of the data before its start time (that is, the transaction ID).

If the write transaction T1 commits during execution, its modification of the data will generate a new version of the data. The version number of this new version will be equal to T1's transaction ID.

For the read transaction T2, if it has started before T1 commits, then even if T1 submits a new data version, T2 can only see the data version at the moment it started, that is, it sees is still the data before T1 submission.

However, if T2 starts after T1 commits, then when T2 reads the data, it will see the new version of the data submitted by T1, because this version of the data already existed before T2 started.

Therefore, whether T2 reads the data after submission or the data before submission depends on whether the start time of T2 is before or after T1 submission. This is how the MVCC model can provide a consistent view.

Question about read view

Material:

2.5 Read view matching conditions:

1. If the data transaction ID <up_limit_id is displayed

If the data transaction ID is less than the minimum active transaction ID in the read view, it is certain that the data already existed before the current transaction was started, so it can be displayed.

2. Data transaction ID>=low_limit_id will not be displayed

If the data transaction ID is greater than the maximum transaction ID of the current system in the read view, it means that the data is
generated after the current read view is created, so the data will not be displayed.

3. Up_limit_id <= data transaction ID < low_limit_id will match the active transaction set trx_ids

If the transaction ID of the data is greater than the smallest active transaction ID and less than or equal to the largest transaction ID of the system, this situation indicates that the data may not have been submitted when the current transaction started.

So at this time we need to match the transaction ID of the data with the active transaction set trx_ids in the current read view:

Case 1: If the transaction ID does not exist in the trx_ids collection (it means that the transaction has been committed when the read view is generated), the data in this case can be displayed.


Case 2: If the transaction ID exists in trx_ids, it means that the data has not been submitted when the read view is generated, but if the transaction ID of the data is equal to creator_trx_id
, then it means that the data is generated by the current transaction itself, and the data generated by itself can be seen by itself, so In this case the data can also be displayed.

Case 3: If the transaction ID exists in trx_ids and is not equal to creator_trx_id, it means that
the data has not been submitted when the read view is generated, and it is not generated by itself, so the data cannot be displayed in this case.

4. When the condition of read view is not satisfied, the data is obtained from the undo log

When the transaction ID of the data does not meet the read view condition, the historical version of the data is obtained from the undo log, and then the transaction number of the historical version of the data is matched with the read view
condition until a piece of historical data that meets the condition is found, or if it is not found returns an empty result;

How do you understand that the "data transaction ID" here refers to the current read transaction id? Why isn't the transaction number the largest? Shouldn't the transaction id be self-increasing?

In MySQL's InnoDB engine, transactions use the multi-version concurrency control (MVCC) mechanism. Under this mechanism, each data row has an associated transaction ID, which represents the creator of this data row version. When a transaction modifies a data row, it creates a new version of the data row, and the transaction ID of the new version is the ID of the transaction. Thus, "data transaction ID" refers to the transaction ID that created this data row version.

For your first question, the "data transaction ID" is not the ID of the current read transaction, but the ID of the transaction that created the current version of the data row. The InnoDB engine does use an auto-incrementing transaction ID to identify transactions, but what you need to understand is that the ID obtained by each transaction at the beginning of it is the maximum value at that point in time, and the ID obtained by the transaction that starts later will be will be bigger. Therefore, if the transaction ID of a data row (that is, the ID of the transaction that created it) is less than the minimum active transaction ID in the read view, then we can be sure that this data row has been created before the current transaction starts, so it can Seen by the current transaction.

Does the "data" here refer to the current latest version of the record being stored in the table?

For your second question, "the data" refers to some version of the data row that the current read transaction is trying to access. This version may not be the latest version of the data row, because in MVCC, the read transaction may see the old version of the data row, depending on the start time of the read transaction and the content of the read view. This is how MVCC is able to provide a consistent view.

In the following case, it is possible that the data transaction ID is a write operation that has been executed ahead of time, right?

Data transaction ID>=low_limit_id will not be displayed

If the data transaction ID is greater than the maximum transaction ID of the current system in the read view, it means that the data is
generated after the current read view is created, so the data will not be displayed.

In the case you mentioned, the "data transaction ID" may indeed have been created by a committed write transaction. If the transaction ID of the data row is greater than the low_limit_id of the read view, it means that the version of the data row was created after the current read transaction started. Therefore, even if the write transaction has been committed, this data version will still not be seen by the current read transaction, because in MVCC, each transaction can only see the data version at and before its start time.

At this time, the active transaction list in the read view is all read transactions. Are there any write transactions?

The list of active transactions recorded in the Read view includes all uncommitted transactions at the start of the read transaction, no matter whether these transactions are read transactions or write transactions. If it is a write transaction, it may modify the data later, creating a new version of the data, but these new versions of the data will not be seen by the current read transaction. If it is a read transaction, it will not modify the data, but in order to prevent phantom read (phantom read) problems, it needs to be included in the list of active transactions

Why can the phantom read problem be prevented by putting the read transaction into the read view active transaction list?

Phantom reading is a concurrency control problem. When one transaction reads some rows, another transaction inserts some new rows, and then reads back, the original transaction finds that new "phantom" rows appear. If the read transaction is placed in the active transaction list of the read view, other concurrent read transactions cannot see the uncommitted changes of the write transaction, so it will not see new "phantom" rows in the consistent view, Thus preventing the phantom read problem. This is determined by the mvcc specification: a transaction can only see version data whose version number is less than or equal to the current transaction id

Why can the transaction id of the updated row still exist in the active transaction list? Since the row has been updated, shouldn’t it mean that it has been submitted?

First of all, the "updated row" here just means that the old version of the data is moved to the undo log, and the new version of the data occupies the current data row. It seems to be updated, but it has not been submitted yet, so the read view itself It may also be in the active transaction list.

Disadvantages of MVCC related

What is write skew?

Write skew is a more complex phenomenon (the problem of concurrent writes), which involves two or more transactions reading the same data at the same time, and then modifying them based on the data read. This can leave the database in an inconsistent state. For example, two transactions both read the same data, and then both modify it based on the read data, which may cause some conflicts and data inconsistencies. To solve write bias, it may be necessary to adopt a more advanced concurrency control mechanism, such as using optimistic locking (Optimistic Locking) or pessimistic locking (Pessimistic Locking)

For example:
if there are two seats, and two users (transaction A and transaction B) each book a seat, then the number of seats should become 0, when A and B simultaneously read the remaining seats (2), It was then all decided to book a seat, each thinking there should be one seat left after booking. But when both transactions are submitted, since they both subtract one seat from 2, the result will be 1 seat instead of 0. This is the so-called write skew problem.

What about Read Skew?

Read Skew (read skew): In the same transaction, reading the same data item multiple times returns different results. This is the "non-repeatable read" problem mentioned above in Read-Write Skew.

What are the flaws of MVCC?

MVCC cannot solve the concurrency problem of multiple write transactions (it cannot solve the write skew problem), and can only be used to solve the concurrency problem between read-write transactions.

Why can't mvcc solve the concurrency problem between multiple read-write composite transactions or between write and write transactions

Type of transaction

Read-only, write-only, read-write compound transactions, generally mvcc can only solve read-only and write-only transactions, or concurrency problems between read-only and read-only transactions, but cannot solve composite transactions or between write-only and write-only transactions Concurrency issues between

For transactions that include both read and write operations, MVCC can also handle read operations well. However, for write operations, if the write operation depends on the read operation in the transaction, then there may be problems. This is the so-called "read-write skew" (read-write skew)

How to solve mvcc's write-skew problem or write-write transaction problem?

1 Let the transaction be executed serially (serialized isolation level), but the concurrency is very low
2 The optimistic lock with version number is used in the database, so that only one transaction will be executed successfully each time
3 Distributed locks can be used to allow Only one transaction is released each time to modify the data row or related business in the database.
4 Pessimistic lock: Pessimistic lock assumes that concurrent transactions will cause data inconsistency, so lock before modifying data to prevent concurrent modification by other transactions. In our bank account example, the account could be locked between checking the balance and changing the balance. In this way, only when one transaction is completed, another transaction can continue, thus avoiding the write skew problem. However, pessimistic locking can lead to degraded concurrency performance.

Application scenarios of MVCC

Mvcc, like other locks, can only guarantee the isolation of transactions, and cannot guarantee the security of transactions that depend on business, right?

Yes, MVCC can only guarantee the isolation of transactions, that is, transactions executed concurrently will not interfere with each other. However, it does not guarantee operationally existing dependencies. This needs to be handled at the business logic level. For example, in the flash sale scenario, even if MVCC is used, techniques such as optimistic locking or pessimistic locking may need to be used to ensure that the product will not be oversold.

Snapshot read and current read, which one should be used in the seckill scenario

In the seckill scenario, we usually need to ensure the correctness of the inventory, so we need to know the current inventory status in real time, that is, we need "current read". Although snapshot read can provide high concurrency performance, because it is based on a data snapshot at a certain point in time, it may not reflect the real-time inventory status, so it may not be suitable for scenarios that require high real-time performance such as seckill.

The snapshot may read old data. Is there a security problem?

The data read by the snapshot may be old, which may cause problems in some scenarios. Taking Lightning Deal as an example, if we place an order based on the inventory data read from the snapshot, it may be oversold. For example, the actual inventory of a commodity is only 1, but in the snapshot reads of multiple transactions, they all see that the inventory is 1, so they all place an order, and as a result, multiple commodities are actually sold. This is a kind of Security Question.

However, this is not to say that snapshot reads are not useful. In some scenarios that do not have high requirements for real-time data but have high requirements for concurrency performance, snapshot read can provide better performance. For example, we want to count the user's purchase history. In this case, the real-time performance of the data is not so important, and snapshot reading can provide better concurrency performance.

Regarding whether to use snapshot reading or current reading, can the mysql database be set, and the granularity is accurate to the table or the library?

In MySQL, you can choose to use snapshot reads (also known as consistent reads) or current reads on a per-transaction basis. This is achieved by setting the isolation level of the transaction. For example, if you choose "READ COMMITTED" isolation level, then each query will read the latest data (current read). And if you select the "REPEATABLE READ" isolation level, a data snapshot will be created at the beginning of the transaction, and all subsequent queries will read data based on this snapshot (snapshot read). This setting is per transaction, not specific to tables or libraries.

Guess you like

Origin blog.csdn.net/yxg520s/article/details/131817634