MySQL deadlock problem frequently occurs online! Share your own textbook-like investigation and analysis process

1. Log

1.1 Business log

The code that has been running smoothly for more than half a year suddenly has deadlock abnormalities frequently in the last few days. The following business logs occur on business machines about every two days during peak business periods:

 INFO 57553 --- [ConsumerThread2] org.example.controller.TestController    : 全局链路跟踪id:2的日志:[TransactionReqVO(userId=4, money=4), TransactionReqVO(userId=2, money=2), TransactionReqVO(userId=5, money=5)]
 INFO 57553 --- [ConsumerThread1] org.example.controller.TestController    : 全局链路跟踪id:1的日志:[TransactionReqVO(userId=5, money=5), TransactionReqVO(userId=1, money=1), TransactionReqVO(userId=4, money=4)]
ERROR 57553 --- [ConsumerThread2] org.example.controller.TestController    : 全局链路跟踪id:2的异常:
### Error updating database.  Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
### The error may exist in org/example/mapper/TestTableMapper.java (best guess)
### The error may involve org.example.mapper.TestTableMapper.update-Inline
### The error occurred while setting parameters
### SQL: UPDATE test_table SET money = money + ? WHERE user_id = ?
### Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction

org.springframework.dao.DeadlockLoserDataAccessException: 
### Error updating database.  Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
### The error may exist in org/example/mapper/TestTableMapper.java (best guess)
### The error may involve org.example.mapper.TestTableMapper.update-Inline
### The error occurred while setting parameters
### SQL: UPDATE test_table SET money = money + ? WHERE user_id = ?
### Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
 at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:266) ~[spring-jdbc-5.0.13.RELEASE.jar:5.0.13.RELEASE]
 at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72) ~[spring-jdbc-5.0.13.RELEASE.jar:5.0.13.RELEASE]
 at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:73) ~[mybatis-spring-2.0.1.jar:2.0.1]
 at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:446) ~[mybatis-spring-2.0.1.jar:2.0.1]
 at com.sun.proxy.$Proxy59.update(Unknown Source) ~[na:na]
 at org.mybatis.spring.SqlSessionTemplate.update(SqlSessionTemplate.java:294) ~[mybatis-spring-2.0.1.jar:2.0.1]
 at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:67) ~[mybatis-3.5.1.jar:3.5.1]
 at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:58) ~[mybatis-3.5.1.jar:3.5.1]
 at com.sun.proxy.$Proxy62.update(Unknown Source) ~[na:na]
 at org.example.service.impl.TestServiceImpl.update(TestServiceImpl.java:16) ~[classes/:na]
 at org.example.manager.impl.BizManagerImpl.transactionMoney(BizManagerImpl.java:25) ~[classes/:na]
 at org.example.manager.impl.BizManagerImpl$$FastClassBySpringCGLIB$$824241b9.invoke(<generated>) ~[classes/:na]

Deadlock is very conspicuous, indicating that there is a deadlock in the business, it must be a business problem. However, the business code has been running for more than half a year. I checked the Git records and found that no one has moved the business-related code recently, indicating that the business may have problems before, but only recently has it reached the conditions for triggering this exception.

A brief summary of the log:

1. What error log is this?

8行:### SQL: UPDATE test_table SET money = money + ? WHERE user_id = ?9行:### Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction

From lines 8-9, we can know that the error is a database error and a rollback caused by a deadlock error. The key SQL is: UPDATE test_table SET money = money +? WHERE user_id =?

2. What is the calling method of the core error, that is, which method is used to start the transaction?

30行:at org.example.manager.impl.BizManagerImpl.transactionMoney(BizManagerImpl.java:25) ~[classes/:na] 31行:at org.example.manager.impl.BizManagerImpl$$FastClassBySpringCGLIB$$824241b9.invoke() ~[classes/:na]

After filtering the jdk class, spring class, and mybatis class, we get the core business error code (30~31 lines). The 31 line is Spring's proxy execution, and the 30 line is the first business code to be executed: BizManagerImpl.transactionMoney

1.2 Database deadlock log

Then go to view the database deadlock log corresponding to the library, use the command: show innodb engine status, and filter out the non-important logs as follows:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-07-14 23:34:29 0x7f958f1d5700
*** (1) TRANSACTION:
TRANSACTION 95146580, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 2
MySQL thread id 6264489, OS thread handle 140273305761536, query id 837446998 10.10.59.164 root updating
UPDATE test_table SET money = money + 5 WHERE user_id = 5
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table `mall`.`test_table` trx id 95146580 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 8; hex 8000000000000005; asc         ;;
 1: len 8; hex 8000000000000006; asc         ;;

*** (2) TRANSACTION:
TRANSACTION 95146581, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 5 row lock(s), undo log entries 2
MySQL thread id 6264490, OS thread handle 140280327919360, query id 837446999 10.10.59.164 root updating
UPDATE test_table SET money = money + 4 WHERE user_id = 4
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table `mall`.`test_table` trx id 95146581 lock_mode X locks rec but not gap
Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 8; hex 8000000000000005; asc         ;;
 1: len 8; hex 8000000000000006; asc         ;;

Record lock, heap no 5 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 8; hex 8000000000000001; asc         ;;
 1: len 8; hex 8000000000000002; asc         ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table `mall`.`test_table` trx id 95146581 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 8; hex 8000000000000004; asc         ;;
 1: len 8; hex 8000000000000005; asc         ;;

*** WE ROLL BACK TRANSACTION (2)

The key points are summarized as follows:

1. When was the last deadlock in the library?

Line 4: 2020-07-14 23:34:29 0x7f958f1d5700 learned that the last deadlock occurred on 2020-07-14 23:34:29

2. Important information about the two transactions caused by the deadlock?

Line 12: RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table mall.test_table trx id 95146580 lock_mode X locks rec but not gap waiting It is learned that the lock waiting for transaction 1 is: lock_mode X locks rec but not gap waiting

Line 24: RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table mall.test_table trx id 95146581 lock_mode X locks rec but not gap learned that the lock held by transaction 2 is: lock_mode X locks rec but not gap

Line 34: RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table mall.test_table trx id 95146581 lock_mode X locks rec but not gap waiting It is learned that the lock waiting for transaction 2 is: lock_mode X locks rec but not gap waiting

Line 39: *** WE ROLL BACK TRANSACTION (2) learned that the last rollback is transaction 1

From lines 12, 24, and 34: index idx_user_id of table mall.test_table learned: the index that caused the deadlock is: idx_user_id

3. Can you know the two specific SQL that caused the deadlock?

No, there are various deadlock situations. There may be more than two SQLs in the transaction. It is impossible to know the specific reason from the deadlock log alone. You must view the transaction context in conjunction with the business code.

2. Theoretical knowledge

During the investigation process, it was discovered that there was a characteristic that affected all the large online users . Since I haven't read the theoretical knowledge of deadlock for a long time, I first understand the basic knowledge of related deadlock.

2.1 Conditions for deadlock

  1. Mutually exclusive conditions: A resource can only be used by one process at a time.
  2. Occupy and wait: When a process is blocked by requesting resources, it keeps on holding the acquired resources.
  3. Cannot be forcibly occupied: The resources already acquired by the process cannot be forcibly deprived before being used up.
  4. Circulating waiting condition: A kind of cyclic waiting resource relationship is formed between several processes.

It is also very simple to destroy the deadlock, just break one of the four conditions. (This case is broken 4)

2.2 Database lock types

The deadlock of the database is more complicated, mainly caused by Insert and Update (in fact, Delete or For Update is not considered in development, because we generally do not have Delete or For Update operations in actual business code. Deletes are all Physically delete, for update is recommended to use less, unless you know that it must be used). InnoDB lock:

  1. Shared lock and exclusive lock (S, X)
  2. Intent lock
  3. Record Locks
  4. Gap Locks
  5. Next-Key Locks
  6. Insert intent lock
  7. Self-increasing lock
  8. Spatial index assertion lock

Here is a reference to the Innodb lock classification on the official website. From the lock_mode X locks rec but not gap of the deadlock log, you can roughly know that X locks, record locks, and gap locks may be involved here (but there is a not, indicating that it is not involved).

3. Analysis from the deadlock log

Before analysis, get the table creation statement of the table: show create table test_table;:

CREATE TABLE `test_table` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `money` bigint(20) NOT NULL,
  `user_id` bigint(20) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_user_id` (`user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8

Then combined with the deadlock log, lock types, and table building statements, the following ambiguous conclusions are drawn:

1. From the 10 and 12 rows of the deadlock log combined with the table index creation

10行:UPDATE test_table SET money = money + 5 WHERE user_id = 512行:RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table mall.test_table trx id 95146580

The UPDATE test_table SET money = money + 5 WHERE user_id = 5 statement of transaction 1 is waiting for the lock: it is updated through the ordinary index idx_user_id, first obtains the X lock with user_id=5, and then applies for the row of the corresponding row's primary key (Record Lock) The lock is blocked (waiting), not including gap locks (not gap). We don't know exactly which primary key is.

2. From the 22, 24 lines of the deadlock log combined with the table index

22行:UPDATE test_table SET money = money + 4 WHERE user_id = 424行:Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0

The UPDATE test_table SET money = money + 4 WHERE user_id = 4 statement of transaction 2 is holding the lock: it is updated through the ordinary index idx_user_id, first obtains the X lock with user_id=4, and then applies for the primary key (Record Lock) of the corresponding row The row lock is successful, not including gap lock (not gap). We don't know which primary key is successful.

3. From the 22 and 34 rows of the deadlock log combined with the table index creation

22行:UPDATE test_table SET money = money + 4 WHERE user_id = 434行:RECORD LOCKS space id 71816 page no 4 n bits 80 index idx_user_id of table mall.test_table trx id 95146581 lock_mode X locks rec but not gap waiting

The UPDATE test_table SET money = money + 4 WHERE user_id = 4 statement of transaction 2 is waiting for the lock: it is updated through the ordinary index idx_user_id, first obtains the X lock with user_id=4, and then applies for the row of the corresponding row's primary key (Record Lock) The lock is blocked (waiting), not including gap locks (not gap). We don't know exactly which primary key is.

Vague conclusions are definitely problematic. The biggest problem is that the SQL statement caused is incorrect, that is, the cause of the deadlock is real, but it is unclear which SQL caused the deadlock. Then we sorted out the following table that may be problematic:

Transaction 1 Transaction 2 Some SQL Some SQL Some SQL user_id = 5 New update operation is blocked Some SQL user_id = 4 Obtained a lock but blocked some SQL some SQL

It can be seen that, in fact, the analysis from the deadlock log alone is relatively one-sided, because the two update operations with user_id of 4 and 5 will not block each other. There must be other SQL. Therefore, we need to add additional business logs. Analysis can restore the complete scene.

4. Analysis from business logs

From the deadlock log, it is impossible to fully know the key SQL and the overall process of the failure site, so we need to use the business log to complete the final analysis of the failure site: through the previous analysis of the business log, we know the most critical calling method It is BizManagerImpl.transactionMoney, check the corresponding source code:

@Override
@Transactional
public boolean transactionMoney(List<TransactionReqVO> transactionReqVOList) throws Exception {
    for (TransactionReqVO transactionReqVO : transactionReqVOList) {
        // 模拟业务操作
        Thread.sleep(1000);
        int updateCount = testTableService.update(transactionReqVO.getUserId(), transactionReqVO.getMoney());
        if (updateCount == 0) {
            log.error("转账异常:" + transactionReqVO);
        }
    }
    return true;
}

We can know that it should be a for loop transaction problem, but which user_id is not clear. Then we check the context of the business log, search through the full link traceId (simulation), and get the following log:

[ConsumerThread2] org.example.controller.TestController    : 全局链路跟踪id:2的日志:[TransactionReqVO(userId=4, money=4), TransactionReqVO(userId=2, money=2), TransactionReqVO(userId=5, money=5)]
[ConsumerThread1] org.example.controller.TestController    : 全局链路跟踪id:1的日志:[TransactionReqVO(userId=5, money=5), TransactionReqVO(userId=1, money=1), TransactionReqVO(userId=4, money=4)]

At this point in the analysis, we can already restore the deadlock scenario. The transaction flow chart is as follows:

MySQL deadlock problem frequently occurs online!  Share your own textbook-like investigation and analysis process

 

5. Combined analysis of business logs and deadlock logs

Adding the incorrect table from the deadlock log analysis to the business log analysis to get the correct table, we get the final correct transaction table with understanding:

MySQL deadlock problem frequently occurs online!  Share your own textbook-like investigation and analysis process

 

It can be known that the SQL of the deadlock log is actually vague, but the reason is correct. As for the specific SQL that caused the deadlock, it needs to be determined from the business log.

6. Aftermath

After simulating the scenario of transaction 2, we can execute the rolled back SQL to manually repair the affected user data (customer first). You can also know that the transactionMoney() method should not add transactions, because each user's update in this business scenario is independent and should not be affected by each other, but when an update fails, we also need to print the corresponding log.

Here we know why there was no problem for more than half a year, and this kind of anomaly happened frequently recently, because only when two transactions are executed at the same time, and the two transactions contain the same two or more user_id will it be possible to trigger The exception. And this kind of user_id is the so-called big user, like the user_id 1 and 2 in this example are small users, although they are also affected, but the frequency is not as high as that of the big user with user_id 4 and 5.

This is also confirmed in actual business scenarios. Not only is the time of failure concentrated in the peak period, but also the users who failed often have those familiar faces. After the subsequent review, these familiar face users are also What we call "large users" (users with high frequency of business operations).

7. Simulation project source code

In order to simulate the method call (message receiving call execution) in the real scene, threads are used to simulate. And use thread sleep to ensure that each transaction is executed long enough to make every simulation execution abnormal.

MySQL deadlock problem frequently occurs online!  Share your own textbook-like investigation and analysis process

 

The project structure is relatively simple, Controller -> Manager -> Service -> Mapper -> DB, after executing curl'localhost:8080/test/consumer', check the command line output to see the business exception log.

The corresponding deadlock log needs to be executed in the corresponding database: show engine innodb status can be seen.

8. Finally

I checked a lot of information and found that there is a project that summarizes the possible SQL corresponding to all deadlock logs: https://github.com/aneasystone/mysql-deadlocks, which also explains the details of the locking process, which is very worthwhile Take a look. The following are some screenshots of the project:

MySQL deadlock problem frequently occurs online!  Share your own textbook-like investigation and analysis process

 

This is a good reference material when encountering complex business scenarios, especially those unfamiliar.

Business log records and full link tracking are very, very important

Guess you like

Origin blog.csdn.net/weixin_48612224/article/details/109191912