After AlwaysOn where Windows Cluster failure, how quickly recovery test (in extreme cases) DB Server on the remaining nodes

AlwaysOn is a collection of high availability and disaster recovery technology both functions, it supports one or more failover of the database as a whole, it implements load balancing across a certain extent, reduce the pressure on the primary server, it is the most a good option. Then when extreme situations occur, the majority of the cluster nodes are hung up, the master node Server database resides also hung up. That is when the Windows cluster Fail, yet how quickly the few surviving node, choose a database to take over the service.

1: test purposes

Windows Failover Cluster due to a malfunction node server too, will make the entire Cluster fail, this time DB database will be on the other remaining server nodes become Recovery Pending status, it can not be used. The following test node is still alive in the tenacious, pick a state to make the database available for fast recovery.

2: Test Environment

Node1 Node1 Node1 ClusterIP ListenerIP
172.XXX.XXX.112 172.XXX.XXX.113 172.XXX.XXX.114 172.XXX.XXX.115 172.XXX.XXX.117
ALWAYSONTEST01

ALWAYSONTEST02

ALWAYSONTEST03    
Primary;Synchronous Commit

Secondary;Synchronous Commit

Secondary;Asynchronous Commit    

 Log master node at this time, see the following:

Each node is operating normally.

3: Test Procedure

Step 1: Close two nodes (XXX.112; XXX.113) so Windows Cluster Fail, Ping Cluster IP timeout display.

         ---- The remaining 172.XXX.XXX.114 keep a copy of the non-synchronized.

Step 2:登入唯一的存活的节点172.XXX XXX.114,SQL 显示错误如下:

 

Step 3:刷新DB,查询可用性组和DB的状态已分别处于Resolving 和Recovery Pending,数据库不可用。

 

此时Listener IP 也不可用

Step 4: 查看对应的Cluster 服务对应的Service Name

(Server ManageràLocal ServeràServices)

或(Server ManageràToolsàComponent ServicesàServices)

 Step5:手动停止群集服务

---- net.exe stop Cluster_Name(实为Service name)

成功关闭后172.XXX.XXX.115无法Ping 通

 

 

  Step6:在单一节点上使用强制仲裁,藉以启动WSFC群集

---- net.exestart Cluster_Name/forcequorum

成功启动后Cluster IP 可以Ping 通;Listener IP 无法Ping 通

通过FailOver Cluster Manger 查看节点和AG的状态如下:

下图为各节点状态;

下图为高可用性组的状态

 

Step 7:重启SQL Serveice 服务

----(个别情况下:首先,Disable后restart,然后再Enable后restart)

Step 8:执行可用性群组的强制性手动容错转移

  ---- ALTER AVAILABILITY GROUP group_name FORCE_FAILOVER_ALLOW_DATA_LOSS (其中 group_name 是可用性组的名称)

 

Step 9:可用性组的状态变为Primary状态,DB显示同步,listener IP也为可用

 

4:补充说明

此时Restart测试过程中关闭的节点(XXX.112;XXX.113),部署其上的DB显示Not Synchronizing。

 

  

本文版权归作者所有,未经作者同意不得转载,谢谢配合!!!

Guess you like

Origin www.cnblogs.com/xuliuzai/p/11069279.html