xpb backup process FTWRL timeout issues

phenomenon:

PXC版本:5.5.34
xtrabackup版本:2.0.6

xtrabackup夜间备份失败,查看备份日志如下:
    ...
    ...
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    >> log scanned up to (386464269338)
    innobackupex: Error: Connection to mysql child process (pid=14432) timedout. (Time limit of 900 seconds exceeded. You may adjust time limit by editing the value of parameter "$mysql_response_timeout" in this script.) while waiting for reply to MySQL request: 'FLUSH TABLES WITH READ LOCK;' at /usr/bin/innobackupex line 381.

Analysis of the investigation:

发现备份开始时间为00:12分,根据备份日志文件时间戳发现innobackupex超时结束时间是01:50
怀疑是否在此期间有大量DML或者大事务的写入导致FTWRL超时。此段时间binlog写入较小,排查慢日志发现此段时间有执行

# Time: 190627  2:00:21
User@Host: pull_data[pull_data] @  [10.x.x.x]
Thread_id: 6984770  Schema: as_tv_online  Last_errno: 0  Killed: 0
Query_time: 2190.463271  Lock_time: 0.000220  Rows_sent: 6791540  Rows_examined: 6791540  Rows_affected: 0  Rows_read: 6791540
Bytes_sent: 5164166545
SET timestamp=1561572021;
select `id`,`services_no`,`order_type`,`order_create_time`,`order_stage`,`order_status`,`source_doc_no`,`implement_no`,`source_doc_type`,`source_info`,`dispatch_info`,`urgency_level`,`order_level`,`service_level`,`upgrade_status`,`checkup_nature`,`checkup_result`,`visit_flag`,`complaint_level`,`present_vip_card`,`mall_wechat_no`,`mall_apply_order_no`,`mall_prolong_contractno`,`demand_desc`,`remarks`,`other_remarks`,`one_level_item`,`two_level_item`,`three_level_item`,`repair_time`,`req_service_begin_time`,`req_service_end_time`,`dispatch_time`,`add_msg_times`,`related_order`,`related_order_type`,`reminder_numbers`,`complaint_numbers`,`subprocess_flag`,`subprocess_numbers`,`net_id`,`net_full_path`,`org_full_path`,`services_id`,`end_userid`,`end_time`,`servicer_receive_flag`,`servicer_refuse_reason`,`cancel_reason`,`cancel_time`,`valid_flag`,`create_name`,`is_del`,`create_time`,`create_id`,`last_modify_time`,`last_modify_id`,`user_need`,`client`,`org_id`,`net_name`,`net_parent_name`,`net_super_name`,`org_name`,`one_level_item_name`,`two_level_item_name`,`three_level_item_name`,`end_user_name`,`settlement_name`,`app_service_type`,`cancel_reason_desc`,`base_fee_amt`,`reminder_time`,`reminder_desc`,`sync_flag`,`cancel_reason_name`,`repair_result`,`adnet_id`,`adnet_name`,`adnet_full_path`,`adnet_org_full_path`,`adorder_id`,`adorder_status`,`product_version`,`rma_no`,`last_modify_name` from t_ser_order_his where 1=1;

Time: 190627  2:04:05
User@Host: pull_data[pull_data] @  [10.x.x.x]
Thread_id: 6984771  Schema: as_tv_online  Last_errno: 0  Killed: 0
Query_time: 2408.901693  Lock_time: 0.000205  Rows_sent: 6609499  Rows_examined: 6609499  Rows_affected: 0  Rows_read: 6609499
Bytes_sent: 3631997475
SET timestamp=1561572245;
select `id`,`ser_order_id`,`service_result`,`org_code`,`handjob_time`,`emp_id`,`call_customer_time`,`change_appoint_status`,`change_appoint_time`,`one_legacy_reason`,`two_legacy_reason`,`detail_reason`,`receive_time`,`contact_express_flag`,`notice_express_time`,`pickup_pay_user`,`pickup_express_no`,`pickup_express_name`,`store_location`,`notice_user_time`,`cust_receive_time`,`send_pay_user`,`send_express_no`,`send_express_name`,`warranty_type`,`service_process_desc`,`one_level_code`,`two_level_code`,`reback_type`,`reback_reason`,`reback_fee_amt`,`fee_total_amt`,`reback_amt`,`reback_time`,`debug_flag`,`nodebug_reason`,`attachments_flag`,`dispatched_flag`,`todoor_flag`,`first_door_time`,`finish_time`,`fault_type`,`fault_type_desc`,`one_level_handle_type`,`two_level_handle_type`,`new_softversion`,`charge_flag`,`invoice_flag`,`charge_amt`,`base_fee_amt`,`msg_fee_amt`,`examine_fee_amt`,`special_approve_amt`,`enjoy_le_amt`,`contact_name`,`contact_phone`,`customer_reciews`,`remote_distance`,`remote_fee`,`appraisal_no`,`finish_order_time`,`old_ser_order_id`,`old_ser_order_no`,`old_ser_order_netid`,`old_ser_order_netcode`,`old_ser_order_empid`,`old_ser_order_empname`,`repeat_ser_order_id`,`repeat_ser_order_no`,`repeat_ser_order_netid`,`repeat_ser_order_netcode`,`repeat_ser_order_empid`,`repeat_ser_order_empname`,`le_pylons_flag`,`main_wall_flag`,`prolong_materiel_no`,`bak_address`,`second_unit_name`,`order_flag`,`collect_mac_type`,`net_id`,`net_full_path`,`is_del`,`create_time`,`create_id`,`last_modify_time`,`last_modify_id`,`charge_details`,`duty_judgment`,`one_level_handle_type_name`,`two_level_handle_type_name`,`one_level_code_name`,`two_level_code_name`,`one_legacy_reason_name`,`two_legacy_reason_name`,`receive_name`,`appoint_time`,`receive_detail`,`damage_reason`,`fault_type_name`,`fault_type_desc_name`,`emp_name`,`emp_mobile_phone`,`reback_reason_name`,`duty_judgment_name`,`send_type`,`ser_product_id`,`flag_rights`,`rights_order_id`,`rights_no`,`rights_name`,`rights_type`,`operate_date`,`rights_time`,`money_used`,`rights_amount`,`repair_result`,`screen_broken`,`is_to_door`,`delivery_user_time`,`return_warehouse_time`,`recv_new_tv_time`,`identi_create_time`,`fault_phenomena`,`is_fixed_screen`,`customer_provides_materials`,`current_reason`,`fixing_position` from t_ser_order_handle_his where 1=1;

User@Host: backup[backup] @  [127.0.0.1]
Thread_id: 6985104  Schema: mysql  Last_errno: 0  Killed: 0
Query_time: 1725.026800  Lock_time: 0.000000  Rows_sent: 0  Rows_examined: 0  Rows_affected: 0  Rows_read: 0
Bytes_sent: 11
use mysql;
SET timestamp=1561572245;   ------->转换后为02:04:05
FLUSH TABLES WITH READ LOCK;

通过监控发现DB活跃线程数在1:15分开始陡增,一点35分继续陡增、1:36以后趋于平稳,到2:00和2:04分别阶梯恢复。

identify the problem:

通过慢日志定位问题原因
slowlog中发现,前边两条查询分别跑了2190.463271s和2408.901693s、flush 跑了1725.026800s
定位慢SQL开始时间点都在1:20分左右。
定位FTWRL开始及结束时间:结束时间02:04:05(SET timestamp=1561572245转换后)  开始时间:1:35分左右(根据执行时间1725.026800往前推28分钟左右)
查看备份信息:备份日志实际结束(备份未成功)时间为1:50
通过以上时间段可以初步定位慢查询时间过久导致FTWRL超过900s失败。

Process analysis:

00:12:02时xtrabackup开始备份,先copy innodb ibd文件
01:20:00左右慢SQL跑起
01:35:00左右xtrabackup copy完成innodb ibd文件,加FTWRL(此过程需加全局锁、关闭open tables、打上阻塞commit的标志)。因为加全局锁成功,但关闭open tables时,因为慢SQL一直未结束,一直夯住(01:35分之后所有的写入都被阻塞,应用不停创建连接,thread running升高,与监控发现的1:35分开始线程陡增想吻合,2点和2点04慢SQL跑完,线程恢复正常)
01:50左右FTWRL锁超时(900s),导致备份结束(与备份日志结束时间吻合)

to sum up:

由于过久慢SQL导致FTWRL 关闭open tables失败,使得备份异常退出。
flush table with read lock失败,thread running为什么没有在flush失败时间点下降?
    flush table with read lock 由xpb发起超时后xpb会结束,flush table with read lock会一直等待慢SQL执行完毕会话才结束,然后写入慢日志,
slowlog记录FTWRL的时间为何不是超时之后就记录而是等待慢查询结束之后再记录?
    理解问题,xpb发起FTWRL命令后超时,XPB会结束,而发起的FTWRL命令会在会话层面一直等待慢日志结束,然后退出会话

Guess you like

Origin www.cnblogs.com/DBA-3306/p/11097149.html