ORACLE 11.2.0.4 rac for linux 链路宕导致的单节点异常宕机

    2018年5月24日,一客户申诉其一套oracle 11.2.0.4 rac for linux异常宕机,从ASM告警日志发现有磁盘IO报错,申请协助处理。
客户环境信息:
操作系统版本:rhel 6.9 X86_64
数据库版本:oracle rac 11.2.0.4

客户的疑问:链路宕为什么导致节点2宕机而节点1却没有受到影响

客户反馈的ASM报错信息:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Handle 0x7fffe009f230 from lib :UFS:: for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Handle 0x7fffe009fe90 from lib :UFS:: for disk :/dev/mapper/data_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe00993b0 for disk :/dev/mapper/data_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe0099d50 for disk :/dev/mapper/data_3:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009a970 for disk :/dev/mapper/fra_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009b590 for disk :/dev/mapper/ocr_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009c1b0 for disk :/dev/mapper/ocr_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009cdd0 for disk :/dev/mapper/ocr_3:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009d9f0 for disk :/dev/mapper/ssddata_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009e610 for disk :/dev/mapper/ssddata_2:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512) )
2018-05-23 16:21:16.296: [   SKGFD][4072036096]Lib :UFS:: closing handle 0x7fffe009f230 for disk :/dev/mapper/ssdredo_1:
2018-05-23 16:21:16.296: [   SKGFD][4072036096]ERROR: -9(Error 27061, OS Error (Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 512))
     本次客户反馈的oracle rac单节点宕机相关分析如下:
    观察故障前,节点2的操作系统日志:
1、5月23 16:17:45节点2操作系统检测到链路宕,信息如下
2、 5月23 16:17:49 节点2操作系统日志检测到内核IO请求错误,位置:盘符sdo,扇区   327698
3、 5月23 16:17:50 节点2操作系统日志检测到多路径映射失败
4、 5月23 16:17:51节点2操作系统日志提示多路径检测到无法完成flush操作
<div font-size:14px;font-variant-numeric:normal;font-variant-east-asian:normal;line-height:21px;white-space:normal;background-color:#ffffff;"="" style="word-wrap: break-word;"> May 23 16:17:51 oadb2 multipathd: ocr_1: map in use
May 23 16:17:51 oadb2 multipathd: ocr_1: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_1: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_1: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdo [8:224]: path removed from map ocr_1
May 23 16:17:51 oadb2 multipathd: sdp: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_2: map in use
May 23 16:17:51 oadb2 multipathd: ocr_2: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_2: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_2: Entering recovery mode: max_retries=6
May 23 16:17:51 oadb2 multipathd: sdp [8:240]: path removed from map ocr_2
May 23 16:17:51 oadb2 multipathd: sdq: remove path (uevent)
May 23 16:17:51 oadb2 multipathd: ocr_3: map in use
May 23 16:17:51 oadb2 multipathd: ocr_3: can't flush
May 23 16:17:51 oadb2 multipathd: ocr_3: load table [0 2097152 multipath 1 queue_if_no_path 0 0 0]
May 23 16:17:51 oadb2 multipathd: ocr_3: Entering recovery mode: max_retries=6
May 23 16:18:20 oadb2 multipathd: ocr_1: Disable queueing
May 23 16:18:20 oadb2 multipathd: ocr_2: Disable queueing
May 23 16:18:21 oadb2 multipathd: ocr_3: Disable queueing
7、 5月 23 16:21节点2ASM日志提示,由于ASM数次尝试上线DATA磁盘组失败实例被PMON进程终止
<div font-size:14px;font-variant-numeric:normal;font-variant-east-asian:normal;line-height:21px;white-space:normal;background-color:#ffffff;"="" style="word-wrap: break-word;"> Wed May 23 16:18:52 2018
ERROR: no read quorum in group: required 2, found 0 disks
NOTE: cache dismounting (clean) group 1/0x5ABDEE06 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 29399, image: oracle@soadb2 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0x5ABDEE06 (DATA) 
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x5abdee06
NOTE: cache deleting context for group DATA 1/0x5abdee06
GMON dismounting group 1 at 37 for pid 29, osid 29399
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15017: diskgroup "DATA" cannot be mounted
ORA-15040: diskgroup is incomplete
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
Additional information: 4096
ORA-27061: waiting for async I/Os failed
Linux-x86_64 Error: 5: Input/output error
Additional information: -1
ERROR: ALTER DISKGROUP DATA MOUNT /* asm agent *//* {0:7:10794} */
Wed May 23 16:21:07 2018
NOTE: ASMB process exiting, either shutdown is in progress 
NOTE: or foreground connected to ASMB was killed. 
Wed May 23 16:21:07 2018
PMON (ospid: 28806): terminating the instance due to error 481
Wed May 23 16:21:08 2018
ORA-1092 : opitsk aborting process
Wed May 23 16:21:08 2018
License high water mark = 12
Instance terminated by PMON, pid = 28806
USER (ospid: 52046): terminating the instance
Instance terminated by USER, pid = 52046
    总结:OA系统,观察节点1的日志,相比节点2的日志OCR磁盘组 没有 出现 Disable queueing,因此节点1的
DB和GI实例没有收到影响。
<span "="" style="word-wrap: break-word;">

猜你喜欢

转载自blog.csdn.net/www_xue_xi/article/details/80786334