【排故篇】ADG实例被LGWR进程异常宕掉，咋回事？

【前言】
Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。
啥原因，怎么回事，怎么破。

具体alert trace如下：
【报错信息】

ORA-04021: timeout occurred while waiting to lock object 
LGWR (ospid: 27954): terminating the instance due to error 4021
Wed Sep 11 02:28:54 2019
opiodr aborting process unknown ospid (28247) as a result of ORA-1092
Wed Sep 11 02:28:54 2019
ORA-1092 : opitsk aborting process
Wed Sep 11 02:28:54 2019
System state dump requested by (instance=1, osid=27954 (LGWR)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc
Instance terminated by LGWR, pid = 27954
error 4021 detected in background process

可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object，紧接着LGWR便terminating the instance。接着看一下Trace文件/oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc。

Trace file /oracle/diag/diag/rdbms/orcl/trace/ORCL_diag_27921_20190911022854.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/oracle/product/11.2.0.4/db_1
System name:    HP-UX
Node name:      ethan-db
Release:        B.11.31
Version:        U
Machine:        ia64
Instance name: ORCL
Redo thread mounted by this instance: 1
Oracle process number: 16
Unix process pid: 27954, image: oracle@ethan-db (LGWR)

*** 2019-08-22 15:14:45.875
*** SESSION ID:(1162.1) 2019-08-22 15:14:45.875
*** CLIENT ID:() 2019-08-22 15:14:45.875
*** SERVICE NAME:() 2019-08-22 15:14:45.875
*** MODULE NAME:() 2019-08-22 15:14:45.875
*** ACTION NAME:() 2019-08-22 15:14:45.875
 
DDE rules only execution for: ORA 312
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
DDE rules only execution for: ORA 313
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
DDE Action 'DB_STRUCTURE_INTEGRITY_CHECK' was flood controlled
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (FLOOD CONTROLLED, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----

*** 2019-09-11 02:28:54.002
ORA-04021: timeout occurred while waiting to lock object 
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+592<-kjzdssdmp()+720<-kjzduptcctx()+512<-kjzdicrshnfy()+160<-$cold_ksuitm()+5808<-$cold_ksbrdp()+2768<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-
opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80 
----- End of Abridged Call Stack Trace -----
*** 2019-09-11 02:28:54.017
LGWR (ospid: 27954): terminating the instance due to error 4021
ksuitm: waiting up to [5] seconds before killing DIAG(27921)

TRACE给的信息也有限，就是给了一些堆栈信息。

借助MOS查下ORA-04021: timeout occurred while waiting to lock object。发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)与遇到的问题是一致的。判断理由：alter日志报错信息相同，第二，trace内容也一致。由文章得知（文档 ID 2183882.1），应该为Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。

查看文档 ID 2183882.1

原因分析：
当ADG通过日志恢复时，LGWR将实例状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析，SQL的解析也能阻止LGWR。

SYS@CRMPRDSDB> show parameter cursor_sharing;

NAME                                 TYPE
------------------------------------ ---------------------------------
VALUE
------------------------------
cursor_sharing                       string
EXACT

通过查看视图v $sesstat 和v$ statname 查询此行为

SYS@ORCL> select a.*,b.name from v$sesstat a , v$statname b
  2  where a.statistic#=b.statistic# 
  3  and a.sid=(select distinct sid from v$mystat)
  4  and b.name like '%parse%';
       SID STATISTIC#      VALUE NAME
---------- ---------- ---------- ------------------------------
      1743        264          0 ADG parselock X get attempts
      1743        265          0 ADG parselock X get successes
      1743        622          4 parse time cpu
      1743        623          5 parse time elapsed
      1743        624         20 parse count (total)
      1743        625          7 parse count (hard)
      1743        626          0 parse count (failures)
      1743        627          0 parse count (describe)
8 rows selected.

如何防止此问题再发生，MOS提供的solution如下：
SOLUTION
Issue matches with bug 11712267 and bug 16717701
Since two bugs are matching with the case,
You can try with option (1) . As per Bug 11712267
change the cursor_sharing to force on Active dataguard (ADG).
Monitor your environment for sometime.
If it crashes again then follow with the option (2)
Option (2):
As per bug description
LGWR can request DBINSTANCE lock in X mode without any timeout which can lead to a hang / deadlock.
Both fixes are already included in 11.2.0.4 but the fix is DISABLED by default.
== > To ENABLE the fix one has to set == > “_adg_parselock_timeout” > to the number of centi-seconds == > LGWR should wait before backing off and retrying the request.
Value should be in centi seconds. == > I Don’t think there is really any hard fast rule for a value - at default (0) it will not timeout.
A value representing a few seconds seems reasonable - if LGWR has been stuck for say 5 seconds waiting it seems reasonable guess it is not going to get the lock.
The param just causes it to abort the current attempt and retry If you want to play safe can start with a higher value then decrease later.
A higher value will just mean more sessions blocked for longer in case of the deadlock situation.
500 Seems reasonable , but I have no data to base it on.
There should be a statistic “ADG parselock X get attempts” If it gets set too small that value would likely increase a lot due to keep timing out and retrying.
This is a dynamic parameter
Follow option (1) .
change the cursor_sharing to force on ADG
If issue re-appears then follow option (2) as below
Please set “_adg_parselock_timeout” to 500 == >
SQL > alter system set “_adg_parselock_timeout”=500 scope=both sid=’*’;

简单翻译如下：
1.将cursor_sharing更改为force，减少SQL解析的时间。
2.如果再次发生该问题，将隐含参数”_adg_parselock_timeout”设置成500，加长解析时间，该参数作用防止超时；此参数是可以动态修改的。
SQL > alter system set “_adg_parselock_timeout”=500 scope=both sid=’*’;

【参考】
https://blog.51cto.com/lyzbg/2090812
【参考】
https://blog.csdn.net/vic_qxz/article/details/88296722
【参考】
ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)

以下为个人公众号“一森咖记”，欢迎关注。
在这里插入图片描述

db_murphy

发布了54 篇原创文章 · 获赞 3 · 访问量 5495

私信关注

【排故篇】ADG实例被LGWR进程异常宕掉，咋回事？

猜你喜欢