rac节点驱逐分析

本次测试模拟私有网卡down掉,rac节点驱逐分析。

可以参考导致实例逐出的五大问题 (Doc ID 1526186.1)
集群资源查看

[qdtais1]@ht01[/home/oracle]$crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       ht01                                         
               ONLINE  ONLINE       ht02                                         
ora.LISTENER.lsnr
               ONLINE  ONLINE       ht01                                         
               ONLINE  ONLINE       ht02                                         
ora.OCR.dg
               ONLINE  ONLINE       ht01                                         
               ONLINE  ONLINE       ht02                                         
ora.asm
               ONLINE  ONLINE       ht01                     Started             
               ONLINE  ONLINE       ht02                     Started             
ora.gsd
               OFFLINE OFFLINE      ht01                                         
               OFFLINE OFFLINE      ht02                                         
ora.net1.network
               ONLINE  ONLINE       ht01                                         
               ONLINE  ONLINE       ht02                                         
ora.ons
               ONLINE  ONLINE       ht01                                         
               ONLINE  ONLINE       ht02                                         
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       ht01                                         
ora.cvu
      1        ONLINE  ONLINE       ht01                                         
ora.ht01.vip
      1        ONLINE  ONLINE       ht01                                         
ora.ht02.vip
      1        ONLINE  ONLINE       ht02                                         
ora.oc4j
      1        ONLINE  ONLINE       ht01                                         
ora.qdtais.db
      1        ONLINE  ONLINE       ht01                     Open                
      2        ONLINE  ONLINE       ht02                     Open                
ora.scan1.vip
      1        ONLINE  ONLINE       ht01                                         
ora.yz.db
      1        OFFLINE OFFLINE                               Instance Shutdown 

 查看hosts文件及网卡信息

[qdtais1]@ht01[/home/oracle]$ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:D0:2C:DC  
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fed0:2cdc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:60 errors:0 dropped:0 overruns:0 frame:0
          TX packets:154 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:7317 (7.1 KiB)  TX bytes:20671 (20.1 KiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.200  Bcast:192.168.20.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fed7:4e75/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:7909 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6555 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:912161 (890.7 KiB)  TX bytes:712119 (695.4 KiB)

eth1:1    Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.204  Bcast:192.168.20.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1:3    Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.202  Bcast:192.168.20.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 08:00:27:BB:03:40  
          inet addr:192.168.0.10  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:febb:340/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1407822 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1092372 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1046688365 (998.1 MiB)  TX bytes:606254225 (578.1 MiB)

eth2:1    Link encap:Ethernet  HWaddr 08:00:27:BB:03:40  
          inet addr:169.254.67.75  Bcast:169.254.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:265652 errors:0 dropped:0 overruns:0 frame:0
          TX packets:265652 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:143272867 (136.6 MiB)  TX bytes:143272867 (136.6 MiB)

[qdtais1]@ht01[/home/oracle]$cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.20.200 ht01
192.168.20.201 ht02
192.168.0.10 ht01-priv1
192.168.0.20 ht02-priv1
192.168.20.202 ht01-vip
192.168.20.203 ht02-vip
192.168.20.204 ht-scanip

关闭节点1心跳私有网卡eth2

[root@ht01 ~]# ifconfig  eth2 down

查看网卡信息

[root@ht01 ~]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 08:00:27:D0:2C:DC  
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fed0:2cdc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:884 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1410 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:65160 (63.6 KiB)  TX bytes:869140 (848.7 KiB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.200  Bcast:192.168.20.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fed7:4e75/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8729 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7292 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:983762 (960.7 KiB)  TX bytes:817872 (798.7 KiB)

eth1:1    Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.204  Bcast:192.168.20.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1:2    Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.203  Bcast:192.168.20.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth1:3    Link encap:Ethernet  HWaddr 08:00:27:D7:4E:75  
          inet addr:192.168.20.202  Bcast:192.168.20.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth2      Link encap:Ethernet  HWaddr 08:00:27:BB:03:40  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:1414086 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1097177 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1051368691 (1002.6 MiB)  TX bytes:608947879 (580.7 MiB)

eth2:1    Link encap:Ethernet  HWaddr 08:00:27:BB:03:40  
          inet addr:169.254.67.75  Bcast:169.254.255.255  Mask:255.255.0.0
          BROADCAST MULTICAST  MTU:1500  Metric:1

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:267864 errors:0 dropped:0 overruns:0 frame:0
          TX packets:267864 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:144385365 (137.6 MiB)  TX bytes:144385365 (137.6 MiB)

日志分析

观察节点1 oracle  alert日志

Thu Mar 26 10:55:29 2020
SKGXP: ospid 4149: network interface with IP address 169.254.67.75 no longer running (check cable)  ---私有ip地址不运行
SKGXP: ospid 4149: network interface with IP address 169.254.67.75 is DOWN
Thu Mar 26 10:55:47 2020
Reconfiguration started (old inc 4, new inc 6)                    ---开始重新分配资源
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Mar 26 10:55:47 2020
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Thu Mar 26 10:55:47 2020
minact-scn: Inst 1 is now the master inc#:6 mmon proc-id:4198 status:0x7   --Inst 1是主节点
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.0014ed0a gcalc-scn:0x0000.0014ed15
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:6 new-inc#:6
Thu Mar 26 10:55:47 2020
Instance recovery: looking for dead threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Beginning instance recovery of 1 threads            --实例开始recover 节点2上的redo
Started redo scan
Completed redo scan
 read 0 KB redo, 0 data blocks need recovery
Started redo application at
 Thread 2: logseq 13, block 47971, scn 1371433
Recovery of Online Redo Log: Thread 2 Group 3 Seq 13 Reading mem 0
  Mem# 0: +DATA/qdtais/onlinelog/group_3.268.1023987437
  Mem# 1: +DATA/qdtais/onlinelog/group_3.269.1023987441
Completed redo application of 0.00MB
Completed instance recovery at                         -- redo恢复完成
 Thread 2: logseq 13, block 47971, scn 1391434
 0 data blocks read, 0 data blocks written, 0 redo k-bytes read
Thread 2 advanced to log sequence 14 (thread recovery)
minact-scn: master continuing after IR
Thu Mar 26 10:56:47 2020
Decreasing number of real time LMS from 1 to 0
Thu Mar 26 11:01:51 2020
db_recovery_file_dest_size of 4407 MB is 5.08% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.

观察节点1的grid日志

2020-03-26 10:55:29.454: 
[cssd(3278)]CRS-1612:Network communication with node ht02 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.180 seconds   ---和节点2的网络通信超时
2020-03-26 10:55:36.456: 
[cssd(3278)]CRS-1611:Network communication with node ht02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.180 seconds
2020-03-26 10:55:41.458: 
[cssd(3278)]CRS-1610:Network communication with node ht02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.170 seconds
2020-03-26 10:55:43.636: 
[cssd(3278)]CRS-1607:Node ht02 is being evicted in cluster incarnation 480633263; details at (:CSSNM00007:) in /u01/app/grid/log/ht01/cssd/ocssd.log.          ---节点2被集群驱逐
2020-03-26 10:55:45.815: 
[cssd(3278)]CRS-1625:Node ht02, number 2, was manually shut down      --节点2集群资源被关闭
2020-03-26 10:55:45.821: 
[cssd(3278)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ht01 . --cssd进程重新配置gc资源
2020-03-26 10:55:45.834: 
[ctssd(3421)]CRS-2407:The new Cluster Time Synchronization Service reference node is host ht01.
2020-03-26 10:55:57.079: 
[crsd(3564)]CRS-5504:Node down event reported for node 'ht02'.
2020-03-26 10:56:00.027: 
[crsd(3564)]CRS-2773:Server 'ht02' has been removed from pool 'Generic'.
2020-03-26 10:56:00.033: 
[crsd(3564)]CRS-2773:Server 'ht02' has been removed from pool 'ora.qdtais'.

观察节点2grid日志

2020-03-26 10:55:28.379: 
[cssd(3208)]CRS-1612:Network communication with node ht01 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.800 seconds     ---和节点1的网络通信超时
2020-03-26 10:55:36.384: 
[cssd(3208)]CRS-1611:Network communication with node ht01 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.790 seconds
2020-03-26 10:55:40.385: 
[cssd(3208)]CRS-1610:Network communication with node ht01 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.790 seconds
2020-03-26 10:55:43.180: 
[cssd(3208)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/grid/log/ht02/
cssd/ocssd.log.
2020-03-26 10:55:43.180: 
[cssd(3208)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/grid/log/ht02/cssd/ocssd.log  --cssd守护进程被强制终止
2020-03-26 10:55:43.222: 
[cssd(3208)]CRS-1652:Starting clean up of CRSD resources.   --清理crsd资源
2020-03-26 10:55:44.259: 
[cssd(3208)]CRS-1608:This node was evicted by node 1, ht01; details at (:CSSNM00005:) in /u01/app/grid/log/ht02/cssd/ocssd.log.

观察节点2oracle  alert日志

 

Thu Mar 26 10:55:45 2020
NOTE: ASMB terminating    --asmb进程终止导致数据库crash
Errors in file /u01/app/db/diag/rdbms/qdtais/qdtais2/trace/qdtais2_asmb_3974.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 32 Serial number: 3
Errors in file /u01/app/db/diag/rdbms/qdtais/qdtais2/trace/qdtais2_asmb_3974.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 32 Serial number: 3
ASMB (ospid: 3974): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 3974

  

 

  

 

  

 

猜你喜欢

转载自www.cnblogs.com/omsql/p/12577374.html