ASM _asm_hbeatiowait

最近一个数据库节点的集群宕了,但是数据库正常,节点没有被踢出集群,下面是ASM记录的日志。

 

WARNING: Waited 391 secs for write IO to PST disk 0 in group 2.

WARNING: Waited 391 secs for write IO to PST disk 0 in group 2.

Fri Aug 31 09:49:27 2018

WARNING: Waited 120 secs for write IO to PST disk 0 in group 2.

WARNING: Waited 120 secs for write IO to PST disk 1 in group 2.

WARNING: Waited 120 secs for write IO to PST disk 0 in group 2.

WARNING: Waited 120 secs for write IO to PST disk 1 in group 2.

Fri Aug 31 09:49:27 2018

NOTE: process _b000_+asm2 (109914) initiating offline of disk 0.3622141768 (OCR_VOTE_0000) with mask 0x7e in group 2

NOTE: process _b000_+asm2 (109914) initiating offline of disk 1.3622141767 (OCR_VOTE_0001) with mask 0x7e in group 2

NOTE: checking PST: grp = 2

GMON checking disk modes for group 2 at 21 for pid 35, osid 109914

ERROR: no read quorum in group: required 2, found 1 disks

NOTE: checking PST for grp 2 done.

NOTE: initiating PST update: grp = 2, dsk = 0/0xd7e57f48, mask = 0x6a, op = clear

NOTE: initiating PST update: grp = 2, dsk = 1/0xd7e57f47, mask = 0x6a, op = clear

GMON updating disk modes for group 2 at 22 for pid 35, osid 109914

ERROR: no read quorum in group: required 2, found 1 disks

Fri Aug 31 09:49:27 2018

NOTE: cache dismounting (not clean) group 2/0x43F58FEE (OCR_VOTE)

WARNING: Offline for disk OCR_VOTE_0000 in mode 0x7f failed.

WARNING: Offline for disk OCR_VOTE_0001 in mode 0x7f failed.

NOTE: messaging CKPT to quiesce pins Unix process pid: 109930, image: oracle@zjhzbjwgzhzg02 (B001)

Fri Aug 31 09:49:27 2018

NOTE: halting all I/Os to diskgroup 2 (OCR_VOTE)

Fri Aug 31 09:49:27 2018

NOTE: LGWR doing non-clean dismount of group 2 (OCR_VOTE)

NOTE: LGWR sync ABA=24.75 last written ABA 24.75

Fri Aug 31 09:49:27 2018

kjbdomdet send to inst 1

detach from dom 2, sending detach message to inst 1

Fri Aug 31 09:49:27 2018

NOTE: No asm libraries found in the system

Fri Aug 31 09:49:27 2018

List of instances:

 1 2

Dirty detach reconfiguration started (new ddet inc 1, cluster inc 28)

 Global Resource Directory partially frozen for dirty detach

* dirty detach - domain 2 invalid = TRUE

 130 GCS resources traversed, 0 cancelled

Dirty Detach Reconfiguration complete

Fri Aug 31 09:49:27 2018

WARNING: dirty detached from domain 2

NOTE: cache dismounted group 2/0x43F58FEE (OCR_VOTE)

SQL> alter diskgroup OCR_VOTE dismount force /* ASM SERVER:1140166638 */

Fri Aug 31 09:49:27 2018

NOTE: cache deleting context for group OCR_VOTE 2/0x43f58fee

GMON dismounting group 2 at 23 for pid 36, osid 109930

NOTE: Disk OCR_VOTE_0000 in mode 0x7f marked for de-assignment

NOTE: Disk OCR_VOTE_0001 in mode 0x7f marked for de-assignment

NOTE: Disk OCR_VOTE_0002 in mode 0x7f marked for de-assignment

NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2

Fri Aug 31 09:49:58 2018

NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2

Fri Aug 31 09:50:28 2018

NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2

Fri Aug 31 09:50:53 2018

SQL> ALTER DISKGROUP OCR_VOTE MOUNT  /* asm agent *//* {0:1:6} */

WARNING: Disk Group OCR_VOTE containing spfile for this instance is not mounted

WARNING: Disk Group OCR_VOTE containing configured OCR is not mounted

WARNING: Disk Group OCR_VOTE containing voting files is not mounted

ORA-15032: not all alterations performed

ORA-15017: diskgroup "OCR_VOTE" cannot be mounted

ORA-15013: diskgroup "OCR_VOTE" is already mounted

ERROR: ALTER DISKGROUP OCR_VOTE MOUNT  /* asm agent *//* {0:1:6} */

Fri Aug 31 09:50:58 2018

NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2

Fri Aug 31 09:51:28 2018

NOTE:Waiting for all pending writes to complete before de-registering: grpnum 2

Fri Aug 31 09:51:29 2018

 Received dirty detach msg from inst 1 for dom 2

 

 

下面是别人博客对此故障的描述:

 

近日,连续收到ASM磁盘dismount,并且是错误“Waited 15 secs for write IO to PST”的问题,这是ASM特有的心跳超时检测,ASM instance会定期检查每个asm disk是不是能正常反馈。所以决定针对这个问题,做个小总结。

在文档ASM diskgroup dismount with "Waited 15 secs for write IO to PST" (Doc ID 1581684.1) 中有下面一段描述:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.

By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
上面描述,可以理解为下面几点:
1. ASM实例会定期检查每一个磁盘组的磁盘状态,是否通信正常;
2. 这个检查,只是针对normal和high冗余模式,对于external冗余,不会遇到这个错误;
3. 默认情况是15s超时,也就是说15s磁盘组还是没有对ASM实例响应的话,就会dismount磁盘组。

        而遇到这个问题的客户,都是使用光纤网络存储,在存储网络出现问题的情况下,会引发这个错误的出现。也就是说,在ASM定期发出检查信息的时候,如果磁盘没有在15s内反馈的话,我就认为磁盘已经无法访问。
        针对这个错误,我尝试在测试环境测试,由于测试环境是VMware的虚拟机,在物理层面删除磁盘,并不会引发这个问题。原因是在同一个主机上的磁盘被异常删除后,ASM的读取操作会立即返回系统层面的IO错误,而不需要去等待错误“Waited 15 secs for write IO to PST”的超时。

      所以,我总结这个错误,只会出现在共享的ASM磁盘,不在物理主机的本地,而是在存储网络中,ASM发出去的检测信息,不能及时被反馈,才会出现这个错误。这时,可能是存储主机,存储网络,甚至存储磁盘的问题,anyway,我ASM没有收到我需要的确认信息,我认为你有问题,如果有问题的磁盘数够多,达到影响数据完整性了,那我ASM就要dismount这个磁盘组了。

        这里对于“Waited 15 secs for write IO to PST”错误信息,根据文档1581684.1介绍,是在11.2.0.3.0之后出现的。同时在文档中有描述,如何手动修改这个检测超时的时间,可以通过参数_asm_hbeatiowait来控制:

alter system set "_asm_hbeatiowait"=<value> scope=spfile sid='*';

<需要重启ASM/CRS来时修改生效。>

为了确认,这个参数是在11.2.0.3之后出现的,我将全部数据库版本都查询一遍,具体可以参考下面信息:
======================10.2===================== 
SQL> select * from v$version; 
BANNER 
---------------------------------------------------------------- 
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod 
PL/SQL Release 10.2.0.5.0 - Production 
CORE 10.2.0.5.0 Production 
TNS for Linux: Version 10.2.0.5.0 - Production 
NLSRTL Version 10.2.0.5.0 - Production 
  
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm; 
hidden parameter value 
-------------------------------------------------------------------------------- ---------- 
_asm_acd_chunks 1 
_asm_allow_only_raw_disks TRUE 
_asm_allow_resilver_corruption FALSE 
_asm_ausize 1048576 
_asm_blksize 4096 
_asm_direct_con_expire_time 120 
_asm_disk_repair_time 14400 
_asm_droptimeout 60 
_asm_emulmax 10000 
_asm_emultimeout 0 
_asm_fob_tac_frequency 3 
hidden parameter value 
-------------------------------------------------------------------------------- ---------- 
_asm_instlock_quota 0 
_asm_kfdpevent 0 
_asm_libraries ufs 
_asm_maxio 1048576 
_asm_skip_resize_check FALSE 
_asm_stripesize 131072 
_asm_stripewidth 8 
_asm_wait_time 18 
_asmlib_test 0 
_asmsid asm 
21 rows selected. 
  
======================11.2.0.1===================== 
sqlplus / as sysdba 
Connected to: 
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production 
With the Partitioning, OLAP, Data Mining and Real Application Testing options 
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; 
hidden parameter value 
-------------------------------------------------------------------------------- 
_asm_hbeatwaitquantum 2 
  
======================11.2.0.2===================== 
 $ sqlplus / as sysdba 
Connected to: 
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production 
With the Partitioning, Oracle Label Security, OLAP, Data Mining 
and Real Application Testing options 
SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm; 
hidden parameter value 
-------------------------------------------------------------------------------- 
_asm_hbeatwaitquantum 2 
  
在11.2.0.3之后才有这个参数出现,也就是说ASM实例对磁盘超时的检测是在11.2.0.3之后才出现的 
======================10.2=====================

SQL> select * from v$version;

BANNER

----------------------------------------------------------------

Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - Prod

PL/SQL Release 10.2.0.5.0 - Production

CORE 10.2.0.5.0 Production

TNS for Linux: Version 10.2.0.5.0 - Production

NLSRTL Version 10.2.0.5.0 - Production

 

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;

hidden parameter value

-------------------------------------------------------------------------------- ----------

_asm_acd_chunks 1

_asm_allow_only_raw_disks TRUE

_asm_allow_resilver_corruption FALSE

_asm_ausize 1048576

_asm_blksize 4096

_asm_direct_con_expire_time 120

_asm_disk_repair_time 14400

_asm_droptimeout 60

_asm_emulmax 10000

_asm_emultimeout 0

_asm_fob_tac_frequency 3

hidden parameter value

-------------------------------------------------------------------------------- ----------

_asm_instlock_quota 0

_asm_kfdpevent 0

_asm_libraries ufs

_asm_maxio 1048576

_asm_skip_resize_check FALSE

_asm_stripesize 131072

_asm_stripewidth 8

_asm_wait_time 18

_asmlib_test 0

_asmsid asm

21 rows selected.

 

======================11.2.0.1=====================

sqlplus / as sysdba

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production

With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;

hidden parameter value

--------------------------------------------------------------------------------

_asm_hbeatwaitquantum 2

 

======================11.2.0.2=====================

 $ sqlplus / as sysdba

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production

With the Partitioning, Oracle Label Security, OLAP, Data Mining

and Real Application Testing options

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;

hidden parameter value

--------------------------------------------------------------------------------

_asm_hbeatwaitquantum 2

 

在11.2.0.3.0之后才有这个参数出现,也就是说ASM实例对磁盘超时的检测是在11.2.0.3之后才出现的

======================11.2.0.3=====================

sys@R11203> select * from v$version;

BANNER

--------------------------------------------------------------------------------

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;

hidden parameter value

hidden parameter value

-------------------------------------------------- --------------------

_asm_hbeatiowait 15

_asm_hbeatwaitquantum 2

 

======================11.2.0.4=====================

SQL> select * from v$version;

BANNER

--------------------------------------------------------------------------------

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - Production

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%undo%' order by ksppinm;

hidden parameter value

-------------------------------------------------------------------------------- ---------

_asm_hbeatiowait 15 <<<<<<<<<<<<<<<<<<<<

_asm_hbeatwaitquantum 2

 

 ======================12.1.0.1=====================

 $ sqlplus / as sysdba

Connected to:

Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production

With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;

hidden parameter value

--------------------------------------------------------------------------------

_asm_hbeatiowait 15

_asm_hbeatwaitquantum 2

 

在12.1.0.2之后,这个参数默认值被调整为120s

 ======================12.1.0.2=====================

 $ sqlplus / as sysdba

 

Connected to:

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> select ksppinm as "hidden parameter", ksppstvl as "value" from x$ksppi join x$ksppcv using (indx) where ksppinm like '\_%' escape '\' and ksppinm like '%asm_hb%' order by ksppinm;

hidden parameter value

--------------------------------------------------------------------------------

_asm_hbeatiowait 120

_asm_hbeatwaitquantum 2

 

希望总结的这个知识点,对你有帮助。日常中,经常感叹,这个问题很简单,但是不sure,测试过后,记录下来,以备查询。

 

关于这个问题的一则案例请参考另外一篇博客:RAC共享磁盘物理路径故障导致OCR、Votedisk所在ASM磁盘组不可访问的案例分析

猜你喜欢

转载自blog.csdn.net/qq_34556414/article/details/82380480
ASM