Drop Table of MySQL performance impact analysis

[Author]

Wang Dong: Ctrip technical support center database experts, investigation and development of intelligent database to database operation and maintenance difficult problems of automation tools have a strong interest.

【Problem Description】

Taiwan has recently encountered MySQL instance appears briefly hang dead MySQL service, performance rose to the moment of concurrent threads, connections surge.
Investigation Error Log file information page_cleaner overtime, our attention:

2019-08-24T23:47:09.361836+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 24915ms. The settings might not be optimal. (flushed=182 and evicted=0, during the time.)
2019-08-24T23:47:16.211740+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4849ms. The settings might not be optimal. (flushed=240 and evicted=0, during the time.)
2019-08-24T23:47:23.362286+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6151ms. The settings might not be optimal. (flushed=215 and evicted=0, during the time.)

【problem analysis】

1, error log in page_cleaner how the information is generated

From source storage / innobase / buf / buf0flu.cc can be seen that, when satisfied curr_time> next_loop_time + 3000 when the condition (more than 4 seconds), the information to be printed thereon in the error log. next_loop_time to 1000ms, namely 1 second to do a page refresh operation.

 if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
                     ulint curr_time = ut_time_ms();

                     if (curr_time > next_loop_time + 3000) {
                            if (warn_count == 0) {
                                   ib::info() << "page_cleaner: 1000ms"
                                          " intended loop took "
                                          << 1000 + curr_time
                                             - next_loop_time
                                          << "ms. The settings might not"
                                          " be optimal. (flushed="
                                          << n_flushed_last
                                          << " and evicted="
                                          << n_evicted
                                          << ", during the time.)";
                                   if (warn_interval > 300) {
                                          warn_interval = 600;
                                   } else {
                                          warn_interval *= 2;
                                   }

                                   warn_count = warn_interval;
                            } else {
                                   --warn_count;
                            }
                     } else {
                            /* reset counter */
                            warn_interval = 1;
                            warn_count = 0;
                     }

                     next_loop_time = curr_time + 1000;
                     n_flushed_last = n_evicted = 0;
              }

The latter half (flushed = 182 and evicted = 0 , during the time.) Log corresponding n_flushed_last and n_evicted two variables, these two variables derived from the following two variables.

 n_evicted += n_flushed_lru;
n_flushed_last += n_flushed_list;

Source can be seen from comments, n_flushed_lru represents refreshing pages from the LRU list tail, i.e. the value of the log as indicated evicted = 0 index.
n_flushed_list: This is from flush_list refresh refresh the list of pages, that is dirty pages, log flushed = value of 182.

/**
Wait until all flush requests are finished.
@param n_flushed_lru    number of pages flushed from the end of the LRU list.
@param n_flushed_list   number of pages flushed from the end of the
            flush_list.
@return         true if all flush_list flushing batch were success. */
static
bool
pc_wait_finished(
    ulint*  n_flushed_lru,
    ulint*  n_flushed_list)

As can be seen from pc_wait_finished function page_cleaner thread contains LRU list and flush_list two partial refresh, but need to wait two parts are complete refresh.

2, MySQL5.7.4 the introduction of multi-threaded page cleaner, but because of the LRU list and refresh the list of dirty pages refresh still coupled together, in the face of high load buffer pool instance, or hot data, there are still performance problems.

1) LRU List prior dirty brush, brush after Flush list dirty, but are mutually exclusive. Flush list that is to say into the brush when dirty, LRU list can not continue to brush dirty, must wait until a next cycle can take place.
2) Another problem is that the brush when dirty, page cleaner coodinator will wait for all of the page cleaner threads after completion will continue to respond to requests dirty brush. This brings the problem is that if a buffer pool instance is hot, then, page cleaner will not be able to respond in a timely manner.
Percona brush dirty version of the LRU list to do a lot of optimization.

3, binlog log analysis problem instance, you can see from between 2019-08-24 23:46:15 to 2019-08-24 23:47:25 did not record any log, indicating that mysql service during this time can not be normal processing the request, hang short lived

mysqlbinlog -vvv binlog --start-datetime='2019-08-24 23:46:15' --stop-datetime='2019-08-24 23:47:25'|less

As can be seen from the counter index during concurrent threads backlog, down QPS processing capabilities, MySQL service later restored, the backlog of requests focused on the release, resulting in a further rise of concurrent connections

4, and from Innodb_buffer_pool_pages_misc Innodb_buffer_pool_pages_free indicators, the problem period in the buffer pool is released within 1 minute concentration of about 16 * (546893-310868) = 3776400,3.7G available memory.
LRU list of resources may mutex lock period, resulting in page cleaner thread is blocked when LRU list refresh, which shows page_cleaner thread execution time is too long.

innodb_buffer_pool_pages_misc and adaptive hash index relevant, the following is a description of the official website

• Innodb_buffer_pool_pages_misc
The number of pages in the InnoDB buffer pool that are busy because they have been allocated for administrative overhead, such as row locks or the adaptive hash index. This value can also be calculated as Innodb_buffer_pool_pages_total − Innodb_buffer_pool_pages_free − Innodb_buffer_pool_pages_data. When using compressed tables, Innodb_buffer_pool_pages_misc may report an out-of-bounds value (Bug #59550).

5. Why AHI short period of time will release a lot of memory of it, by the slow query log investigation we have to drop table during operation, but the drop table capacity of about 50G, not great, why drop table will cause a short hang MySQL service death on server performance will have much impact, we did a simulation test.

[Testing] process reproducibility

To further validate our test environment simulation tests in the effects drop table for high-performance concurrent MySQL.
1, sysbench tools to do the pressure test, first create the table 8 200 million recorded in the test environment, a single table capacity 48G
2, open innodb_adaptive_hash_index, measured by olap template press 1 hour, fill the target eight tables corresponding to the AHI memory
3, re-open pressure measurement thread only make access to sbtest1 table, MySQL's observation visits
4, the new session running drop table test.sbtest2, saw a drop of 48G consumed 64.84 seconds table

5, with a custom detecting changes in the script and Innodb_buffer_pool_pages_free Innodb_buffer_pool_pages_misc second indicators, see released during large drop table Innodb_buffer_pool_pages_misc, Innodb_buffer_pool_pages_free value while growth, substantially uniform release and the total content increased, as shown in FIG.

6, during the drop table, MySQL hang in dead state, QPS for a long time down 0

7, errorlog also reproduce page_cleaner log information

so far to reproduce the problem phenomenon.

[Why MySQL briefly hang dead]

1, during the pressure measurement, the gripping pstack, show engine innodb status, and the correlation table events_waits_summary_global_by_event_name performance_schema the other site information mutex
2, in SEMAPHORES related information, can be seen a large number of S-lock request Thread hang dead period, it is the same 140,037,411,514,112 locks held by a thread to block, lasted 64 seconds

--Thread 140037475792640 has waited at row0purge.cc line 862 for 64.00 seconds the semaphore:
S-lock on RW-latch at 0x966f6e38 created in file dict0dict.cc line 1183
a writer (thread id 140037411514112) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 862
Last time write locked in file /mysql-5.7.23/storage/innobase/row/row0mysql.cc line 4253
--Thread 140037563102976 has waited at srv0srv.cc line 1982 for 57.00 seconds the semaphore:
X-lock on RW-latch at 0x966f6e38 created in file dict0dict.cc line 1183
a writer (thread id 140037411514112) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 862
Last time write locked in file /mysql-5.7.23/storage/innobase/row/row0mysql.cc line 4253

3, by ROW OPERATIONS information, see the MySQL Main Thread is also the same thread 140,037,411,514,112 blocked state in enforcing dict cache limit state

3 queries inside InnoDB, 0 queries in queue
17 read views open inside InnoDB
Process ID=39257, Main thread ID=140037563102976, state: enforcing dict cache limit
Number of rows inserted 1870023456, updated 44052478, deleted 22022445, read 9301843315
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

4, you can see the thread 140,037,411,514,112 SQL drop table test.sbtest2 statement is executed, indicating held during drop table lock, blocking the other threads and Main Thread

---TRANSACTION 44734025, ACTIVE 64 sec dropping table——
10 lock struct(s), heap size 1136, 7 row lock(s), undo log entries 1
MySQL thread id 440836, OS thread handle 140037411514112, query id 475074428 localhost root checking permissions
drop table test.sbtest2

5, the following is a simplified call stack information caught drop table, the contrast can be seen that 64 seconds, drop table functions are performed btr_search_drop_page_hash_index, empty information corresponding to the recorded AHI

Thread 32 (Thread 0x7f5d002b2700 (LWP 8156)):
#0 ha_remove_all_nodes_to_page
#1 btr_search_drop_page_hash_index
#2 btr_search_drop_page_hash_when_freed
#3 fseg_free_extent
#4 fseg_free_step
#5 btr_free_but_not_root
#6 btr_free_if_exists
#7 dict_drop_index_tree
#8 row_upd_clust_step
#9 row_upd
#10 row_upd_step
#11 que_thr_step
#12 que_run_threads_low
#13 que_run_threads
#14 que_eval_sql
#15 row_drop_table_for_mysql
#16 ha_innobase::delete_table
#17 ha_delete_table
#18 mysql_rm_table_no_locks
#19 mysql_rm_table
#20 mysql_execute_command
#21 mysql_parse
#22 dispatch_command
#23 do_command
#24 handle_connection
#25 pfs_spawn_thread
#26 start_thread ()
#27 clone ()

6, we can see through the code when calling row_drop_table_for_mysql drop table function in row_mysql_lock_data_dictionary (TRX) ; position plus metadata exclusive lock

row_drop_table_for_mysql(
const char* name,
trx_t* trx,
bool drop_db,
bool nonatomic,
dict_table_t* handler)
{
dberr_t err;
dict_foreign_t* foreign;
dict_table_t* table = NULL;
char* filepath = NULL;
char* tablename = NULL;
bool locked_dictionary = false;
pars_info_t* info = NULL;
mem_heap_t* heap = NULL;
bool is_intrinsic_temp_table = false;
DBUG_ENTER("row_drop_table_for_mysql");
DBUG_PRINT("row_drop_table_for_mysql", ("table: '%s'", name));
ut_a(name != NULL);
/* Serialize data dictionary operations with dictionary mutex:
no deadlocks can occur then in these operations /
trx->op_info = "dropping table";
if (handler != NULL && dict_table_is_intrinsic(handler)) {
table = handler;
is_intrinsic_temp_table = true;
}
if (table == NULL) {
if (trx->dict_operation_lock_mode != RW_X_LATCH) {
/
Prevent foreign key checks etc. while we are
dropping the table */
row_mysql_lock_data_dictionary(trx);
locked_dictionary = true;
nonatomic = true;
}

7 to Main Thread, for example, calling srv_master_evict_from_table_cache function to get the X-lock on when RW-latch is blocked

/********************************************************************//**
Make room in the table cache by evicting an unused table.
@return number of tables evicted. /
static
ulint
srv_master_evict_from_table_cache(
/
==============================/
ulint pct_check) /
!< in: max percent to check */
{
ulint n_tables_evicted = 0;
rw_lock_x_lock(dict_operation_lock);
dict_mutex_enter_for_mysql();
n_tables_evicted = dict_make_room_in_cache(
innobase_get_table_cache_size(), pct_check);
dict_mutex_exit_for_mysql();
rw_lock_x_unlock(dict_operation_lock);
return(n_tables_evicted);
}

8, see dict_operation_lock comments, you need to create drop table X lock when operating, while some other background threads, such as Main Thread When checking dict cache, also need to obtain dict_operation_lock X-lock, so it is blocked

/** @brief the data dictionary rw-latch protecting dict_sys
table create, drop, etc. reserve this in X-mode; implicit or
backround operations purge, rollback, foreign key checks reserve this
in S-mode; we cannot trust that MySQL protects implicit or background
operations a table drop since MySQL does not know of them; therefore
we need this; NOTE: a transaction which reserves this must keep book
on the mode in trx_t::dict_operation_lock_mode /
rw_lock_t
dict_operation_lock;

9, while the user can not acquire the lock because the thread is in a suspended state, when the lock can not be obtained immediately, calls row_mysql_handle_errors the error code passed to the upper layer for processing

/****************************************************************//
Handles user errors and lock waits detected by the database engine.
@return true if it was a lock wait and we should continue running the
query thread and in that case the thr is ALREADY in the running state.
/
bool
row_mysql_handle_errors

The following is a simplified user thread call stack information:

Thread 29 (Thread 0x7f5d001ef700 (LWP 8159)):
#0  pthread_cond_wait@@GLIBC_2.3.2
#1  wait
#2  os_event::wait_low
#3  os_event_wait_low
#4  lock_wait_suspend_thread
#5  row_mysql_handle_errors
#6  row_search_mvcc
#7  ha_innobase::index_read
#8  handler::ha_index_read_map
#9  handler::read_range_first
#10 handler::multi_range_read_next
#11 QUICK_RANGE_SELECT::get_next
#12 rr_quick
#13 mysql_update
#14 Sql_cmd_update::try_single_table_update
#15 Sql_cmd_update::execute
#16 mysql_execute_command
#17 mysql_parse
#18 dispatch_command
#19 do_command
#20 handle_connection
#21 pfs_spawn_thread
#22 start_thread
#23 clone

10, page_cleaner background thread did not catch the obvious obstruction relationship, only to see below normal call stack information

Thread 55 (Thread 0x7f5c7fe15700 (LWP 39287)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1 os_event::timed_wait
#2 os_event::wait_time_low
#3 os_event_wait_time_low
#4 pc_sleep_if_needed
#5 buf_flush_page_cleaner_coordinator
#6 start_thread
#7 clone
Thread 54 (Thread 0x7f5c7f614700 (LWP 39288)):
#0 pthread_cond_wait@@GLIBC_2.3.2 ()
#1 wait
#2 os_event::wait_low
#3 os_event_wait_low
#4 buf_flush_page_cleaner_worker
#5 start_thread
#6 clone

Conclusion & Solutions]

MySQL problem hang short dead drop table caused by the drop is due to a large table space using the AHI, AHI calls to perform cleanup actions, will consume a long time, long held dict_operation_lock the X lock during execution, blocking other background thread and the user thread;
drop the Table execution ends the lock release, MySQL backlog of user threads run centrally, there has been concurrent threads and connections instant rise phenomenon.
To circumvent the problem, consider turning AHI before the drop table.

Guess you like

Origin www.cnblogs.com/CtripDBA/p/11465315.html