HA mysql automatic switching scheme (+ keepalived + semi-sync logic determines the third party data)

1 Introduction

Ever since the Internet entered the operation and maintenance of this line of business, every moment not to bother to availability. nginx, tomcat, buffers, queues, database, the most basic requirement is the availability of each link to avoid single point of failure, automatically failover. mysql high-availability solutions speaking a lot, but really want to use in a production environment, a large area of ​​your home, find this has the disadvantage that is not perfect. MHA previously used for some time, achieve relatively complex (probably because I have not been engaged in perl), plus the author is no longer updated, always worried about mistakenly cut, split brain ...... As for PXC \ MGR on tall, the former significantly defect, which is still relatively new, the lack of large-scale experience.

My purpose: small companies only 1-2 DBA, no one would have been watching, and even some of the time away from home, the Internet can not deal with the problem within 1-2 hours, then hung up the Lord, we must be able to switch automatically, otherwise, the company shut down. . .

My request: simple, practical, reliable, do not easily cut from the master, not the maximum extent possible split brain, do not lose business.

Common online is a simple solution: mysql double main + keepalived. At first glance it was perfect: mysql two libraries writable, keepalived switch back and forth, like the management of non-state service as comfortable ......, in fact, this approach has obvious drawbacks: when the master-slave synchronization delay, if switching occurs, the data occurs confusion probability is too high (for many systems, rather down a few minutes, and do not produce large amounts of dirty data)

2. Program Overview

My solution is to mysql + Improved dual master keepalived scheme: mysql copied from the primary synchronization keepalived half + + programs and data to third auxiliary switching Analyzing

main feature:

1, from the library read-write before switching the main

2, keepalived not configured virtual_ipaddress, implemented by the IP notify script drift

3, from the main switch to "heavyweight" level, do not easily switch:

Continuous monitoring keepalived 2 minutes and then switch fails (as will be restarted mysqld down mysqld_safe, not switched);

Switching script to determine a lot of logic to ensure that no human error causes a switch, it will not switch to ensure that does not meet the conditions for switching from the library.

4, switching is unidirectional, from the main cut needs to be rebuilt from the main deployment environment

5, when it comes to MHA, many people worry about split brain problem, keepalived network environment simpler than mha, ensuring a two-node master-slave network during a two-story, split-brain occurs even keepalived (ie two-node has a VIP), ARP radio can ensure that only one node of external services VIP is, therefore split brain leads to the possibility of dirty data close to zero.

Project Name: mkf (mysql keepalived failover)

Script Address: https://github.com/meishd/mkf

Chart:

3. Program Description

3.1. Heartbeat table

All database table has a heartbeat dbadmin.heartbeat (id, create_time), JOB create_time second update, the table is used to monitor the synchronization delay from our library; used herein is determined from this library table meets the main cutting conditions.

3.2. Python dbfailover data extraction and

dbfailover in separate examples, two tables: the master_info master database configuration table, switch arbitration information table master_arbit_info

master_info Auto:

  • db_name primary key that uniquely identifies the name of a master-slave database
  • ip primary IP, after switching to change the
  • port port
  • user_name user name
  • status status 0: Enable, 1: Disable
  • update_time DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, according to the field determines whether to reset the connection pool

master_arbit_info fields:

  • id auto_increment
  • db_name database name
  • semi_status semi-synchronous state, 1: synchronization, 0: Asynchronous
  • Time table records the heartbeat heartbeat_time
  • Current time create_time

python program reads master_info connection pool table for each main library initialization, periodically (10 seconds) detects connection pool reconnection fails, periodically (every second) to extract the information master_arbit_info.

3.2. The master node controls

Down 120 seconds continuous primary (interval 10, fall 13) before switching, automatic restart is not switched, switching easily avoided;

The master node through notify_backup VIP delete control, reducing the risk of operation and maintenance operations, to avoid affecting the normal use of the main library

  1. Embodiments facilitate seamless line, it will not affect the existing normal VIP access during the initial configuration;
  2. Only the master node notify_backup, to ensure unidirectional switching, without additional operations into the Master;
  3. The master node will first briefly entered just started keepalived backup status, and then enter the master state, the script can determine the VIP exit the delete operation according to the start time;
  4. Correct startup sequence: primary first start keepalived, then start from keepalived, starting from the first such misuse, enters the main keepalived backup state, then exit port VIP delete operation by judging mysql.

3.3. From the control node

Mainly from the ascending node from control by notify_master aimed at the maximum extent possible to ensure that the transaction is not lost from the library to the main lift, or would rather not switch.

The following conditions must be met in order, otherwise terminate the operation:

  1. 3306 port master server fails
  2. master_arbit_info no data within 120 seconds
  3. master_arbit_info data within 180 seconds
  4. master_arbit_info last semi_status = 1
  5. master_arbit_info last update normal heartbeat
  6. Playback from a library of all relay log, if not complete, is detected once every three seconds, the timeout time of 10 minutes
  7. 0 <= (heartbeat from the library - master_arbit_info last heartbeat) <= 1 sec

Once the above conditions, the switching operation from the formal library:

  1. Record slave state show slave status \ G, for the primary recovery verify transaction lost
  2. set read_only=0
  3. set event_scheduler=1
  4. ip addr add ${VIP}/24 dev ${DEV}
  5. arping -I ${DEV} -c 1 ${VIP}
  6. arping -I ${DEV} -c 1 -s ${VIP} ${GATEWAY}

From the library after the normal cut-based log:

# tail -20 notify_master.log  

20200318 14:42:48 notify master begin...
20200318 14:42:48 1. master is offline
20200318 14:42:48 2. master_arbit_info records within 120 seconds: 0
20200318 14:42:48 3. master_arbit_info records within 180 seconds: 52
20200318 14:42:48 4. master_arbit_info last semi status: 1
20200318 14:42:48 5. master_arbit_info last heartbeat_time after create_time: 0
20200318 14:42:48 6. slave exec log lag behind read log: 0
20200318 14:42:48 7. heartbeat lag between master and slave: 1
20200318 14:42:49 switch to master
Warning: Using a password on the command line interface can be insecure.
20200318 14:42:49 add vip
ARPING 10.40.12.104 from 10.40.12.104 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
ARPING 10.40.12.254 from 10.40.12.104 eth1
Unicast reply from 10.40.12.254 [84:D9:31:9F:29:75]  1.416ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
20200318 14:42:51 notify master end

# more slave_status.log.20200318_144248 
Warning: Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 10.40.12.101
                  Master_User: lbadmin
                  Master_Port: 3366
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 74770
               Relay_Log_File: relay-bin.000005
                Relay_Log_Pos: 74980
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 74770
              Relay_Log_Space: 137461
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master '[email protected]:3306' - retry-time: 60  retries: 3
               Last_SQL_Errno: 0

3.4. Test Case

Test object Analyzing conditions Simulation of abnormal conditions
Main library
notify_backup.sh
1.keepalived start time more than 60 seconds Each time a new start briefly into the backup state keepalived, re-entering the master state, triggering abnormal
2.mysql port barrier The master node keepalived.conf will notify_backup mistakenly written as notify_master, after starting trigger abnormal
From the library
notify_master.sh
1. The main server mysql port nowhere From the first node keepalive, leads into the master from the state, triggering abnormal
2.master_arbit_info no data within 120 seconds keepalived script monitoring time is determined less than 120 seconds, such as interval 10, fall 5
3.master_arbit_info data within 180 seconds data extraction program hang python
4.master_arbit_info last semi_status = 1 The DML big amount of data, or after a period of time renewed stop slave library, ensure that the semi-synchronous degraded than 2 minutes, turn off the main library
5.master_arbit_info last update normal heartbeat Main library is closed heartbeat job
6. replay all the logs from the library, if not wait 10 minutes to complete the cycle Main library DDL large table, DDL off after the completion of the main library
7.0 <= (-master_arbit_info last heartbeat heartbeat from the library) <= 1 Not with the actual scene simulation, delete the latest data in Table 2 master_arbit_info after closing the main library

 

4. The operation and maintenance operations

4.1. Python deployment

  1. Create a third-party mysql database dbfailover, execute the script install_dbfailover.sql, you will need to maintain the main library initialization information to master_info
  2. In the main library to perform install_dbtarget.sql managed
  3. Deployment python data extraction program (which may be the same server and dbfailover): python3 installation environment, requirement: PyMySQL == 0.9.3 DBUtils == 1.3 APScheduler == 3.6.3, dbmanager provided in the connection pool information mkf.py: managerdb_pool = PooledDB (host = 'dbmanater_ip', user = 'user1', passwd = 'password1')
  4. Startup script: python mkf.py

4.2 The main deployment environment from the library

Installation and deployment from the main server keepalived
master node profile: keepalived_master.conf, script file: notify_backup.sh
from a node profile: keepalived_slave.conf, script file: notify_master.sh
wherein keepalived_master / slave.conf modified IP information, notify_backup / master .sh modify variables file header

4.3. Keepalived start and stop order

When starting from the first master node keepalived, from keepalived renewed node;

If the first newcomer from the Lord, will not affect the online business, but to restart from the node keepalived, make sure to enter the main master status.

4.4. Operation after failover

Stop keepalived two nodes;

If the old primary server is repaired, view Executed_Gtid_Set information, compared with the record of slave status when switching from the main node to confirm whether transaction loss;

If the transaction is missing the parsing binlog, data processing problems with the development, if any, to build a new master-slave, adjust keepalived configuration;

Master_info update information in the IP, python program automatically update the connection pool configuration within 10 seconds.

 

==== finished, will soon be deployed in a production environment, please Paizhuan ====

 

 

 

 

发布了26 篇原创文章 · 获赞 25 · 访问量 2万+

Guess you like

Origin blog.csdn.net/sdmei/article/details/104927564