ASM diskgroup dismount with "Waited 15 secs for write IO to PST"
ASM diskgroup dismount with "Waited 15 secs for write IO to PST"
SYMPTOMS
Normal or high redundancy diskgroup is dismounted with these WARNING messages.
Note-ASM alert.log Sat Mar 07 05:03:10 2015 WARNING: Waited 15 secs for write IO to PST disk 1 in group 2. WARNING: Waited 15 secs for write IO to PST disk 1 in group 2. WARNING: Waited 18 secs for write IO to PST disk 1 in group 2. WARNING: Waited 18 secs for write IO to PST disk 1 in group 2. WARNING: Waited 21 secs for write IO to PST disk 1 in group 2. WARNING: Waited 21 secs for write IO to PST disk 1 in group 2. WARNING: Waited 24 secs for write IO to PST disk 1 in group 2. WARNING: Waited 24 secs for write IO to PST disk 1 in group 2. Sat Mar 07 05:03:22 2015 WARNING: Waited 27 secs for write IO to PST disk 1 in group 2. WARNING: Waited 27 secs for write IO to PST disk 1 in group 2. WARNING: Waited 30 secs for write IO to PST disk 1 in group 2. WARNING: Waited 30 secs for write IO to PST disk 1 in group 2. WARNING: Waited 33 secs for write IO to PST disk 1 in group 2. WARNING: Waited 33 secs for write IO to PST disk 1 in group 2. WARNING: Waited 36 secs for write IO to PST disk 1 in group 2. WARNING: Waited 36 secs for write IO to PST disk 1 in group 2. Sat Mar 07 05:03:34 2015
- The WARNING information shown above appears in the ASM alert.log log: WARNING: Waited 15 secs for write IO to PST disk 1 in group 2. The general idea of the log information is that the PST communication link is accessing disk 1 in disk group 2 . The time waited for 15 seconds, and the continuous wait was triggered. The timeout wait will increment the wait time each time on the basis of the frequency trigger. The reason for this situation is generally related to the operating system network communication link, the database host disk or the setting of the timeout parameter. We continue to view ASM 's alert.log log for further analysis.
Note-DiskGroup Dsimounted Mon Tue 09 16:32:11 2015 NOTE: process _b000_+asm1 (1051) initiating offline of disk 0.3915951733 (DATA_0000) with mask 0x7e in group 2 NOTE: process _b000_+asm1 (1051) initiating offline of disk 1.3915951732 (DATA_0001) with mask 0x7e in group 2 NOTE: checking PST: grp = 2 GMON checking disk modes for group 2 at 7 for pid 28, osid 1051 ERROR: no read quorum in group: required 2, found 1 disks NOTE: checking PST for grp 2 done. NOTE: initiating PST update: grp = 2, dsk = 0/0xe968ae75, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 2, dsk = 1/0xe968ae74, mask = 0x6a, op = clear GMON updating disk modes for group 2 at 8 for pid 28, osid 1051 ERROR: no read quorum in group: required 2, found 1 disks Mon Tue 09 16:32:11 2015 NOTE: cache dismounting (not clean) group 2/0xEF985E9D (DATA) NOTE: messaging CKPT to quiesce pins Unix process pid: 1056, image: oracle@rac1 (B001) Mon Tue 09 16:32:11 2015 NOTE: halting all I/Os to diskgroup 2 (DATA) Mon Tue 09 16:32:11 2015 NOTE: LGWR doing non-clean dismount of group 2 (DATA) NOTE: LGWR sync ABA=30.108 last written ABA 30.108 WARNING: Offline for disk DATA_0000 in mode 0x7f failed. WARNING: Offline for disk DATA_0001 in mode 0x7f failed
- Disk 1 in disk group 2 is slow to respond or hangs for some reason , which triggers a wait at the ASM level. However, oracle 's ASM mechanism only waits for 15 seconds in the disk norresponsiness state , which is the default setting. Although the continuous waiting mechanism will automatically increase the waiting time in version 11.2.0.4 , the waiting time of the disk IO will also have a limit. When ASM is sure that the disks in the disk group are unresponsive, it will OFFLINE the target failed disk.
Mon Tue 09 16:32:11 2015 kjbdomdet send to inst 2 detach from dom 2, sending detach message to inst 2 Mon Tue 09 16:32:11 2015 NOTE: No asm libraries found in the system Mon Tue 09 16:32:11 2015 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 1, cluster inc 16) ASM Health Checker found 1 new failures Global Resource Directory partially frozen for dirty detach * dirty detach - domain 2 invalid = TRUE 128 GCS resources traversed, 0 cancelled Dirty Detach Reconfiguration complete Mon Tue 09 16:32:11 2015
- At the same time, oracle ASM will also try to reconfigure the communication link of the corresponding faulty disk in ASM and save the state of the clusterware and ASM communication link at this time. In the above log information, it is displayed as DETACH RECONFIGURATION information. After that, Oracle will try to re-establish the communication link of the faulty disk and the MOUNT target disk group to restore the original normal state.
Mon Mar 09 16:32:27 2015 Received dirty detach msg from inst 2 for dom 2 Mon Mar 09 16:32:27 2015 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 2, cluster inc 16) Global Resource Directory partially frozen for dirty detach * dirty detach - domain 2 invalid = TRUE 128 GCS resources traversed, 0 cancelled freeing rdom 2 Dirty Detach Reconfiguration complete
Mon Mar 09 16:32:41 2015 NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2 Mon Mar 09 16:32:58 2015 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15079: ASM file is closed Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15079: ASM file is closed Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15079: ASM file is closed Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15079: ASM file is closed Mon Mar 09 16:33:11 2015 SUCCESS: diskgroup DATA was dismounted SUCCESS: alter diskgroup DATA dismount force /* ASM SERVER:4019740317 */ Mon Mar 09 16:33:11 2015 NOTE: diskgroup resource ora.DATA.dg is offline SUCCESS: ASM-initiated MANDATORY DISMOUNT of group DATA Mon Mar 09 16:33:11 2015 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15078: ASM diskgroup was forcibly dismounted WARNING: requested mirror side 1 of virtual extent 5 logical extent 0 offset 724992 is not allocated; I/O request failed WARNING: requested mirror side 2 of virtual extent 5 logical extent 1 offset 724992 is not allocated; I/O request failed Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14247.trc: ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted Mon Mar 09 16:33:11 2015 SQL> alter diskgroup DATA check /* proxy */ ORA-15032: not all alterations performed ORA-15001: diskgroup "DATA" does not exist or is not mounted ERROR: alter diskgroup DATA check /* proxy */ NOTE: client exited [14233] Mon Mar 09 16:33:16 2015 NOTE: [crsd.bin@rac1 (TNS V1-V3) 1581] opening OCR file
CAUSE
- Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
- By the way the heart beat delays are sort of ignored for external redundancy diskgroup.ASM instance stop issuing more PST heart beat until it succeeds PST revalidation.
- but the heart beat delays do not dismount external redundancy diskgroup directly.
- +Some of the paths of the physical paths of the multipath device are offline or lost
- +During path 'failover' in a multipath set up
- +Server load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8 briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer.And then _asm_hbeatiowait helps in setting the time out.
SOLUTION
- 1] Check with OS and Storage admin that there is disk unresponsiveness.
- 2] Possibly keep the disk responsiveness to below 15 seconds.
This will depend on various factors like
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter
+ Operating System
+ Presence of Multipath ( and Multipath Type )
+ Any kernel parameter
- So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.For example, on AIX rw_timeout setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you need to find out this. - 3] If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):
_asm_hbeatiowait
- As per internal bug 17274537 , based on internal testing the value should be increased to 120 secs, the same will be fixed in 12.2
Run below in asm instance to set desired value for _asm_hbeatiowait
alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';
-
And then restart asm instance / crs, to take new parameter value in effect.