oracle Database Crashes With ORA-00494

Database Crashes With ORA-00494 (Doc ID 753290.1)


In this Document

 

APPLIES TO:

Oracle Database Cloud Schema Service - Version N/A and later
Oracle Database Exadata Cloud Machine - Version N/A and later
Oracle Cloud Infrastructure - Database Service - Version N/A and later
Oracle Database Exadata Express Cloud Service - Version N/A and later
Oracle Database Backup Service - Version N/A and later
Information in this document applies to any platform.
 

SYMPTOMS

Database may crash with the following error in the alert file (in 11g xml format):

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 20725'

This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).

This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.

Call stack
==============
[01]: kjdgpstackdmp []<-- Signaling
[02]: kjdglblkrdmpint []
[03]: ksikblkrdmp [KSI]
[04]: ksqgtlctx [VOS]
[05]: ksqgelctx [VOS]
[06]: kcc_get_enqueue []
[07]: kccocx []
[08]: kcc_begin_txn_internal []
[09]: krsaibcx []
[10]: kcrrcrl_dbc []
[11]: kcrrcrlc []
[12]: kcrrwkx []
[13]: kcrrwk []
[14]: ksbcti [background_proc]
[15]: ksbabs [background_proc]
[16]: ksbrdp [background_proc]
[17]: opirip []
[18]: opidrv []
[19]: sou2o []
[20]: main []
[21]: _start []

In the System State Dump you can find this short stack
=======================================
kaio<-aiowait<-skgfospo<-skgfrwat<-ksfdwtio

CAUSE

This could be due to any of the following:

Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.
From the alert.log we can see:

  • The database has waited too long for a CF enqueue, so the next error is reported:
    ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'
     
  • Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.

Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).

Cause#2: Checking the I/O State in the AWR report we find that:
Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform

Tablespace Av Rd(ms)
=========== ==========
APPS_TS_TX_DATA 928.93
APPS_TS_TX_IDX 531.43
APPS_TS_SUMMARY 60.14
TEMP 76.57
SYSTEM 38.08
APPS_UNDOTS1 103.75
SYSAUX 61.43
APPS_TS_INTERFACE 38.50
APPS_TS_QUEUES 72.60
APPS_TS_ARCHIVE 49.90
NOETIX_TS 83.39
APPS_TS_SEED 123.20
TOOLS 30.00
APPS_TS_NOLOGGING 50.00
and as per Doc ID 1275596.1 How to Tell if the IO of the Database is Slow that A typical multi-block synchronous read of 64 x 8k blocks (512kB total) should have an average of at most 20 milliseconds before worrying about 'slow IO'. Smaller requests should be faster(10-20ms) whereas for larger requests, the elapsed time should be no more than 25ms.for all tablespaces are greater than 20ms

Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'
and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

For HP Platforms: HP analyzed this and reported that this is due to memory wastage by Oracle setting max_concurrent for ASYNC ports to 4096 and 5000.

This is not memory leak. It is necessary to stop the Oracle application to free objects. The object for index 23 is used for buffer headers for async I/O requests. The number of created buffer headers depends on the max_concurrent value when calling ioctl() for async_config . If you pass the big value as max_concurrent , the number of objects for index 23 will be increased. This is an application's coding issue.

The Bug 8965438 causes the async IO setup for Oracle to use too much memory and the node goes into a hang, resulting a database hang as well. Bug is fixed in 11.2.0.2

In this case, Apply Patch 8965438 if available for your platform

SOLUTION

Solution#1:
We usually suggest to configure the redo log switches to be done every 20~30 min to reduce the contention on the control files.
You can use the V$INSTANCE_RECOVERY view column OPTIMAL_LOGFILE_SIZE to determine a recommended size for your online redo logs. This field shows the redo log file size in megabytes that is considered optimal based on the current setting of  FAST_START_MTTR_TARGET. If this field consistently shows a value greater than the size of your smallest online log, then you should configure all your online logs to be at least this size.

Solution#2:
Check the Storage used for storing the database as this issue is I/O issue as per collected data 

Solution#3:

This kill blocker interface / ORA-494 was introduced in 10.2.0.4. This new mechanism will kill *any* kind of blocking process, non-background or background.

  • The difference will be that if the enqueue holder is a non-background process, even if it is killed, the instance can function without it.
  • In case the holder is a background process, for example the LGWR, the kill of the holder leads to instance crash.

If you want to avoid the kill of the blocker (background or non-background process), you can set

_kill_controlfile_enqueue_blocker=false.


This means that no type of blocker will be killed anymore although the resolution to this problem should focus on why the process is holding the enqueue for so long. Also, you may prefer to only avoid killing background processes, since they are vital to the instance, and you may want to allow the killing of non-background blokers.

This has been addressed in a secondary bug - unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED' which was closed as Not a bug.
 

In order to prevent a background blocker from being killed, you can set the following init.ora parameter to 1 (Default value is 3 for 10g and 2 for 11g databases).

_kill_enqueue_blocker=1


With this parameter, if the enqueue holder is a background process, then it will not be killed, therefore the instance will not crash. If the enqueue holder is not a background process, the new 10.2.0.4 mechanism will still try to kill it.

The reason why the blocker interface with ORA-494 is kept is because, in most cases, customers would prefer crashing the instance than having a cluster-wide hang.

_kill_enqueue_blocker = { 0 | 1 | 2 | 3 }

    0. Disables this mechanism and no foreground or background blocker process in enqueue will be killed.
    1. Enables this mechanism and only kills foreground blocker process in enqueue while background process is not affected.
    2.  Default value in 11g. Enables this mechanism and only kills background blocker process in enqueue.
    3.  Default value in 10g. Enables this mechanism and kills blocker processes in enqueue.[This section is not visible to customers.]

REFERENCES

NOTE:1915481.1 - Troubleshooting ORA-494 Errors

猜你喜欢

转载自blog.csdn.net/royjj/article/details/131111670