awful! How many people have pitted this MySQL bug?

Problem Description

Recently, after upgrading the table of an important Mysql customer online from 5.6 to 5.7, a "Duplicate key" error occurred during the insertion process on the master, and it appeared on both the master and the RO instance.

Take one of the tables as an example. The auto increment id viewed by the "show create table" command before the migration is 1758609, and it becomes 1758598 after the migration. Actually, the maximum value of the auto increment column of the new table generated by the migration is 1758609 with max.

The user uses the Innodb engine, and according to the operation and maintenance students, they have encountered similar problems before, and they can be restored to normal after restarting.

Kernel troubleshooting

As users reported that the access was normal on 5.6, an error was reported after switching to 5.7. Therefore, the first thing to suspect is that there is a problem with the 5.7 kernel, so the first reaction is to search for similar problems from the official bug list to avoid repeated car building. After searching, I found that there is a similar bug officially. Here is a brief introduction to the bug.

Background knowledge 1

Auto increment related parameters and data structure in Innodb engine

The main parameters include: innodb_autoinc_lock_mode is used to control the lock mode for obtaining self-increment, auto_increment_increment, auto_increment_offset are used to control the increment interval and start offset of the self-increment column.

The main structures involved include: the data dictionary structure, which saves the current auto increment value of the entire table and the protection lock; the transaction structure, which saves the number of rows processed within the transaction; the handler structure, which saves the loop iteration information of multiple rows within the transaction.

This part of the article on the Internet is better, please refer to: (https://www.cnblogs.com/zengkefu/p/5683258.html).

Background knowledge 2

Process of accessing and modifying autoincrement in mysql and Innodb engine

(1) Save and restore the autoincrement value when the data dictionary structure (dict_table_t) is swapped in and out. When swapping out, save the autoincrement in the global mapping table, and then eliminate the dict_table_t in the memory. When swapping in, it is restored to the dict_table_t structure by searching the global mapping table. Related functions are dict_table_add_to_cache and dict_table_remove_from_cache_low.

(2) The row_import, table truncate process updates the autoincrement.

(3) When the handler is opened for the first time, it will query the value of the largest auto-increment column in the current table, and initialize the value of autoinc in the data_dict_t structure of the table with the value of the largest column plus one.

(4) Insert process. The stack related to autoinc modification is as follows:

ha_innobase::write_row: In the third step of write_row, the update_auto_increment function in the handler is called to update the value of auto increment
    handler::update_auto_increment: Call the Innodb interface to obtain a self-increment, and adjust the obtained self-increment according to the value of the current auto_increment related variables; at the same time, set the value of the next auto-increment column to be processed by the current handler.
        ha_innobase::get_auto_increment: Get the current auto increment value in dict_tabel, and update the next auto increment value to the data dictionary according to the global parameters
            ha_innobase::dict_table_autoinc_initialize: Update the value of auto increment, if the specified value is greater than the current value, then update.
        handler::set_next_insert_id: Set the value of the auto-increment column of the next row to be processed in the current transaction.

(5) update_row. For the "INSERT INTO t (c1,c2) ​​VALUES(x,y) ON DUPLICATE KEY UPDATE" statement, regardless of whether the row pointed to by the unique index column exists or not, the value of auto increment needs to be advanced. The relevant code is as follows:

    if (error == DB_SUCCESS
        && table->next_number_field
        && new_row == table->record[0]
        && thd_sql_command(m_user_thd) == SQLCOM_INSERT
        && trx->duplicates)  {
        ulonglong    auto_inc;
                ……
        auto_inc = table->next_number_field->val_int();
        auto_inc = innobase_next_autoinc(auto_inc, 1, increment, offset, col_max_value);
            error = innobase_set_max_autoinc(auto_inc);
                ……
    }

Judging from our actual business process, our errors may only involve insert and update processes.

BUG 76872 / 88321: "InnoDB AUTO_INCREMENT produces same value twice"

(1) Overview of the bug: When autoinc_lock_mode is greater than 0 and auto_increment_increment is greater than 1, multi-threaded insert operations on the table at the same time immediately after the system restarts will cause a "duplicate key" error.

(2) Reason analysis: Innodb will set the value of autoincrement to max(id) + 1 after restarting. At this point, when inserting for the first time, the write_row process will call handler::update_auto_increment to set autoinc-related information. First obtain the current autoincrement value (ie max(id) + 1) through ha_innobase::get_auto_increment, and modify the next autoincrement value to next_id according to the autoincrement related parameters. When auto_increment_increment is greater than 1, max(id) + 1 will not be greater than next_id. After handler::update_auto_increment obtains the value returned by the engine layer, in order to prevent the possibility that some engines did not consider the current auto increment parameter when calculating the self-increment, it will recalculate the self-increment of the current row according to the parameters, because Innodb internally considers the global Parameter, so the self-increment calculated by the handle layer on the self-increment id returned by Innodb is also next_id, and a row with the self-increment id as next_id will be inserted. The handler layer will set the next autoincrement value based on the next_id value of the current row at the end of write_row. If during the next autoincrement period when the write_row has not yet set the table, another thread is also performing the insert process, then the self-increment value it obtains will also be the next_id. This creates duplication.

(3) Solution: The global autoincrement parameter is considered when the engine internally obtains the autoincrement column, so that the autoincrement value obtained by the first insert thread after restart is not max(id) + 1, but next_id, and then set the next autoincrement according to next_id Value. Since this process is protected by locks, other threads will not get duplicate values ​​when they get autoincrement.

Through the above analysis, this bug only occurs when autoinc_lock_mode> 0 and auto_increment_increment> 1. In actual online business, both parameters are set to 1, so the possibility of online problems caused by this bug can be ruled out.

On-site analysis and recurrence verification

Since the official bugs failed to solve our problems, we have to rely on our own efforts and analyze the errors.

(1) Analyze the law of max id and autoincrement. Because the user's table is set with the ON UPDATE CURRENT_TIMESTAMP column, you can grab the max id, autoincrement, and the most recently updated records of all the error tables to see if there is anything law. The captured information is as follows:

At first glance, this error is quite regular. The update time column is the time of the last insertion or modification. Combined with the values ​​of auto increment and max id, the phenomenon is very much like the last batch of transactions only updated the auto-increment id of the row. The value of auto increment is not updated. Reminiscent of the introduction to the usage of auto increment in the [official document], the update operation can only update the auto increment id but not trigger the auto increment advancement. Following this idea, I tried to reproduce the user's scene. The reproduction method is as follows:

At the same time, in the binlog, we also see the update auto-increment operation. As shown:

However, because the binlog is in ROW format, we cannot determine whether this is caused by a kernel problem causing the change of the auto-increment column or the user's own update. Therefore, we contacted the customer for confirmation, and the user was quite sure that he did not update the auto-increment column. So how did these auto-increment columns come from?

(2) Analyze the user's table and sql statement to continue the analysis, and found that the user has three types of tables (hz_notice_stat_sharding, hz_notice_group_stat_sharding, hz_freeze_balance_sharding), and these three tables have auto-incrementing primary keys. However, the autoinc error occurred in the first two types, but the hz_freeze_balance_sharding table did not have an error. Is it because users have different access to these two tables? Grabbing the user's sql statement, sure enough, the first two tables use the replace into operation, and the last table uses the update operation. Is the problem caused by the replace into statement? Searching for official bugs, another suspected bug was found.

bug #87861: “Replace into causes master/slave have different auto_increment offset values”

the reason:

(1) Mysql is actually implemented through delete + insert statement for replace into, but in ROW binlog format, it will record update type log to binlog. Insert statement will update autoincrement synchronously, update will not.

(2) Replace into is operated in the delete+insert way on the Master, autoincrement is normal. After copying to the slave based on the ROW format, the slave machine will play back according to the update operation, only the value of the auto-increment key in the row is updated, and the autoincrement is not updated. Therefore, max(id) is greater than autoincrement on the slave machine. At this time, in the ROW mode, the binlog records the values ​​of all the columns for the insert operation, and the auto-increment id will not be re-allocated during playback on the slave, so no error will be reported. But if the slave cuts to the master, a "Duplicate key" error will occur when it encounters an Insert operation.

(3) Since the user migrates from 5.6 to 5.7, and then directly inserts the operation on 5.7, it is equivalent to slave switching the master, so an error will be reported.

solution

Possible solutions on the business side:

(1) Binlog is changed to mixed or statement format

(2) 用Insert on duplicate key update代替replace into

Possible solutions on the kernel side:

(1) If a replace into statement is encountered in the ROW format, logevent in the statement format is recorded, and the original statement is recorded to binlog.

(2) Record the logevent of the replace into statement as a delete event and an insert event in the ROW format.

Experience

(1) Changes in the two parameters of autoincrement's autoinc_lock_mode and auto_increment_increment can easily lead to duplicate keys. Try to avoid dynamic modification during use.

(2) When encountering online problems, you should first do an on-site analysis to clarify the scene of the failure, the user's SQL statement, the scope of the failure and other information, and at the same time, the configuration information, binlog and even instance data of the involved instance Wait to make a backup in case of expiration and loss. Only in this way can we accurately match the scene when looking for official bugs. If the official has no related bugs, we can also analyze independently through existing clues.


 

Guess you like

Origin blog.csdn.net/bjmsb/article/details/108623450