Analysis of mysql deadlock problem

An online service reports the following exception from time to time (about 20 times a day): "Deadlock found when trying to get lock;".

      Oh, My God! is a deadlock problem. Although there are not many errors reported, and it does not seem to have much impact on the performance at present, it still needs to be solved, and the lack of maintenance will become a performance bottleneck one day.
     In order to analyze the problem more systematically, this article will discuss the five aspects of deadlock detection, the relationship between index isolation level and lock, the cause of deadlock, and problem location.

 Figure 1 Application log

1 How is deadlock discovered?

1.1 Deadlock causes && detection methods

     Did the two cars on the left cause a deadlock? no! Did the four cars on the right cause a deadlock? Yes!

                                                                      Figure 2 Deadlock description

      The storage engine we use for mysql is innodb. From the log, innodb actively detects the deadlock and rolls back a transaction that was waiting for it. The question is, how does innodb detect deadlock?

     The intuitive method is that when two transactions wait for each other, when one of the waiting time exceeds a certain threshold set, one of the transactions is rolled back, and the other transaction can continue to execute. This method is simple and effective. In innodb, the parameter innodb_lock_wait_timeout is used to set the timeout period.

     Only using the above method to detect deadlocks is too passive. Innodb also provides a wait-for graph algorithm to actively detect deadlocks. Whenever a lock request cannot meet the needs immediately and enters waiting, the wait-for graph algorithm will be triggered. .

1.2 wait-for graph原理

     How do we know that the four cars in the picture above are deadlocked? They wait for each other's resources and they form a loop! We regard each vehicle as a node. When node 1 needs to wait for the resources of node 2, a directed edge is generated to point to node 2, and finally a directed graph is formed. We only need to detect whether there is a loop in this directed graph, and a loop is a deadlock! This is the wait-for graph algorithm.
                                                                                            Figure 3 wait for graph

     Innodb regards each transaction as a node, and the resource is the lock occupied by each transaction. When transaction 1 needs to wait for the lock of transaction 2, a directed edge is generated from 1 to 2, and finally a directed graph is formed.

1.2 innodb isolation level, index and lock 

      Deadlock detection is the life-saving straw that innodb gives us when deadlock occurs. We need it, but what we need more is the ability to avoid deadlock. How to avoid it as much as possible? This requires knowledge of locks in innodb.

1.2.1 The relationship between locks and indexes

       Suppose we have a message table (msg) with 3 fields. Suppose id is the primary key, token is a non-unique index, and message has no index.

id: bigint

token: varchar(30)

message: varchar(4096)

     Innodb uses a clustered index for the primary key, which is a data storage method. Table data is stored together with the primary key, and the leaf nodes of the primary key index store row data. For ordinary indexes, the leaf nodes store the primary key value.

                                                                                 Figure 4 Clustered index and secondary index
     The relationship between indexes and locks is analyzed below.
1) delete from msg where id=2;

     由于id是主键,因此直接锁住整行记录即可。
                                                                               图5
2)delete from msg where token=’ cvs’;

    由于token是二级索引,因此首先锁住二级索引(两行),接着会锁住相应主键所对应的记录;
                                                                       图6
3)delete from msg where message=订单号是多少’;

     message没有索引,所以走的是全表扫描过滤。这时表上的各个记录都将添加上X锁。
                                                                        图7

1.2.2 锁与隔离级别的关系

     大学数据库原理都学过,为了保证并发操作数据的正确性,数据库都会有事务隔离级别的概念:1)未提交读(Read uncommitted);2)已提交读(Read committed(RC));3)可重复读(Repeatable read(RR));4)可串行化(Serializable)。我们较常使用的是RC和RR。

     提交读(RC):只能读取到已经提交的数据。

     可重复读(RR):在同一个事务内的查询都是事务开始时刻一致的,InnoDB默认级别。

     我们在1.2.1节谈论的其实是RC隔离级别下的锁,它可以防止不同事务版本的数据修改提交时造成数据冲突的情况,但当别的事务插入数据时可能会出现问题。

       如下图所示,事务A在第一次查询时得到1条记录,在第二次执行相同查询时却得到两条记录。从事务A角度上看是见鬼了!这就是幻读,RC级别下尽管加了行锁,但还是避免不了幻读。

                                                                     图8

     innodb的RR隔离级别可以避免幻读发生,怎么实现?当然需要借助于锁了!

     为了解决幻读问题,innodb引入了gap锁。

      在事务A执行:update msg set message=‘订单’ where token=‘asd’;

      innodb首先会和RC级别一样,给索引上的记录添加上X锁,此外,还在非唯一索引’asd’与相邻两个索引的区间加上锁。

       这样,当事务B在执行insert into msg values (null,‘asd',’hello’); commit;时,会首先检查这个区间是否被锁上,如果被锁上,则不能立即执行,需要等待该gap锁被释放。这样就能避免幻读问题。
                                                                           图9

     推荐一篇好文,可以深入理解锁的原理:http://hedengcheng.com/?p=771#_Toc374698322

3 死锁成因

     了解了innodb锁的基本原理后,下面分析下死锁的成因。如前面所说,死锁一般是事务相互等待对方资源,最后形成环路造成的。下面简单讲下造成相互等待最后形成环路的例子。

3.1不同表相同记录行锁冲突

     这种情况很好理解,事务A和事务B操作两张表,但出现循环等待锁情况。

                                                                       图10

3.2相同表记录行锁冲突

     这种情况比较常见,之前遇到两个job在执行数据批量更新时,jobA处理的的id列表为[1,2,3,4],而job处理的id列表为[8,9,10,4,2],这样就造成了死锁。

                                                                          图11

3.3不同索引锁冲突

     这种情况比较隐晦,事务A在执行时,除了在二级索引加锁外,还会在聚簇索引上加锁,在聚簇索引上加锁的顺序是[1,4,2,3,5],而事务B执行时,只在聚簇索引上加锁,加锁顺序是[1,2,3,4,5],这样就造成了死锁的可能性。

                                                                          图12

3.4 gap锁冲突

     innodb在RR级别下,如下的情况也会产生死锁,比较隐晦。不清楚的同学可以自行根据上节的gap锁原理分析下。
                                                                               图13

4 如何尽可能避免死锁

1)以固定的顺序访问表和行。比如对第2节两个job批量更新的情形,简单方法是对id列表先排序,后执行,这样就避免了交叉等待锁的情形;又比如对于3.1节的情形,将两个事务的sql顺序调整为一致,也能避免死锁。

2)大事务拆小。大事务更倾向于死锁,如果业务允许,将大事务拆小。

3)在同一个事务中,尽可能做到一次锁定所需要的所有资源,减少死锁概率。

4)降低隔离级别。如果业务允许,将隔离级别调低也是较好的选择,比如将隔离级别从RR调整为RC,可以避免掉很多因为gap锁造成的死锁。

5)为表添加合理的索引。可以看到如果不走索引将会为表的每一行记录添加上锁,死锁的概率大大增大。

5 如何定位死锁成因

     下面以本文开头的死锁案例为例,讲下如何排查死锁成因。

1)通过应用业务日志定位到问题代码,找到相应的事务对应的sql;

      因为死锁被检测到后会回滚,这些信息都会以异常反应在应用的业务日志中,通过这些日志我们可以定位到相应的代码,并把事务的sql给梳理出来。

1
2
3
4
5
start tran
1 deleteHeartCheckDOByToken
2 updateSessionUser
...
commit

      此外,我们根据日志回滚的信息发现在检测出死锁时这个事务被回滚。

2)确定数据库隔离级别。

     执行select @@global.tx_isolation,可以确定数据库的隔离级别,我们数据库的隔离级别是RC,这样可以很大概率排除gap锁造成死锁的嫌疑;

3)找DBA执行下show InnoDB STATUS看看最近死锁的日志。

     这个步骤非常关键。通过DBA的帮忙,我们可以有更为详细的死锁信息。通过此详细日志一看就能发现,与之前事务相冲突的事务结构如下:

1
2
3
4
5
start tran
1 updateSessionUser
2 deleteHeartCheckDOByToken
...
commit

  这不就是图10描述的死锁嘛!

 

转载请标明源地址:http://www.cnblogs.com/LBSer

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324682683&siteId=291194637