ORA-00600: internal error code, arguments: [504], [0x38006F868], [160]错误详解(原创)

操作系统:Solaris 数据库环境:9i 9.2.0.8

不知道是Oracle的600bug太多还是最近笔者运气太好,经常遇到600bug,上礼拜5在处理完客户数据库无法正常启动后,立马出现了600bug,详细错误如下

ORA-00600: internal error code, arguments: [504], [0x38006F868], [160], [7], [shared pool], [2], [0], [0x38006F778]

刚开始笔者以为只有在数据库起停的时候才会出现这个错误,后来隔天客户给我打电话说,应用系统无法连接,通过查看日志推断出是由于上述bug导致内存溢出,从而使JBoss没有正常连接到数据库。

在网上找到以下资料
信息1

相关:matalink查到信息
Bug 5888835: ORA-600 [504] DURING FLUSH SHARED_POOL 
--------------------------------------------------------------------------------
Bug 属性
--------------------------------------------------------------------------------
类型 B - Defect 已在产品版本中修复 -
严重性 2 - Severe Loss of Service 产品版本 9.2.0.8
状态 36 - Duplicate Bug. To Filer 平台 59 - HP-UX PA-RISC (64-bit)
创建时间 17-Feb-2007 平台版本 11.11
更新时间 18-Feb-2007 基本 Bug 5508574
数据库版本 9.2.0.8 
影响平台  Generic 
产品源 Oracle 
相关产品
--------------------------------------------------------------------------------
产品线 Oracle Database Products 系列 Oracle Database
区域 Oracle Database 产品 5 - Oracle Server - Enterprise Edition
Hdr: 5888835 9.2.0.8 RDBMS 9.2.0.8 VOS HEAP MGMT PRODID-5 PORTID-59 ORA-600 5508574
Abstract: ORA-600 [504] DURING FLUSH SHARED_POOL
*** 02/17/07 03:49 am ***
TAR:
----
PROBLEM:
--------

Ct got the following error when upgrading 8.1.7.4 to 9.2.0.8. This seems
to occur when flushing shared_pool on the upgrade script.

ORA-600: internal error code, arguments: [504], [0xC00000038E5B2530],
[640], [7], [shared pool], [2], [0], [0xC00000038E5B2418]
This can also be reproduced by manually flushing shared pool.
The customer's system has 64 CPUs. The ORA-600 [504] was suppressed
when _kgl_latch_count=30 was se
t.
DIAGNOSTIC ANALYSIS:
--------------------
The heapdump level 2 information when ORA-600[504] occurred says the following.
The number of next slot reaches to 255.It seems to be the same issue as bug 5562921 (base bug 5508574).
Thanks.
====
/home/oracle/udump/dz00001_ora_22436.trc
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
With the Partitioning option
JServer Release 9.2.0.8.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0
System name:   HP-UX
Node name:     DC-0-001
Release:       B.11.11
Version:       U
Machine:       9000/800
Instance name: DZ00001
Redo thread mounted by this instance: 1
Oracle process number: 17
Unix process pid: 22436, image: oracle@DC-0-001 (TNS V1-V3)
*** 16:17:31.085
*** ID:(20.280) 2007-02-17 16:17:31.083
KGH Latch Directory Information
ldir state: 2  next slot: 255
Slot [  1] Latch: c0000001a654c5c0  Index: 2  Flags:  3  State: 2  next:
0000000000000000
Slot [  2] Latch: c000000182a04468  Index: 1  Flags:  3  State: 2  next:
0000000000000000
Slot [  3] Latch: c0000001a654dcb0  Index: 2  Flags:  3  State: 2  next:
0000000000000000
Slot [  4] Latch: c000000182a047e8  Index: 1  Flags:  3  State: 2  next:
0000000000000000
Slot [  5] Latch: c000000182a04940  Index: 2  Flags:  3  State: 2  next:
0000000000000000
Slot [  6] Latch: c0000001a654e340  Index: 1  Flags:  3  State: 2  next:
0000000000000000
Slot [  7] Latch: c0000001a6553528  Index: 2  Flags:  3  State: 2  next:
0000000000000000
Slot [  8] Latch: c0000001a65557f8  Index: 1  Flags:  3  State: 2  next:
0000000000000000
Slot [  9] Latch: c0000001a6556988  Index: 2  Flags:  3  State: 2  next:
c0000001a65aff98
Slot [ 10] Latch: c0000001a6558328  Index: 1  Flags:  3  State: 2  next:
c0000001a65affb0
WORKAROUND:
-----------
Setting _kgl_latch_count=30
RELATED BUGS:
-------------
bug 5562921
bug 5508574
REPRODUCIBILITY:
----------------
It can be reproduced by manually flushing shared pool.
(without _kgl_latch_count=30)
TEST CASE:
----------
none
STACK TRACE:
------------
ksedmp ksddoa ksdpcg ksdpec ksfpec kgeriv kgesiv ksesic7 kslgetl ksfglt
kghupr_flg kghupr kglrfcl kglobcl kglobfr kglobf0 kglhpd_internal kglhpd
kghfrx kghfrunp kghfsh_helper kghfsh kkyasy opiexe opiall0 kpoal8 opiodr
ttcpip opitsk opiino opiodr opidrv sou2o main
SUPPORTING INFORMATION:
-----------------------
I will trace files later.
24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------
DIAL-IN INFORMATION:
--------------------
IMPACT DATE:
------------
*** 02/17/07 03:58 am *** (CHG: Sta->16)
*** 02/17/07 03:58 am ***
*** 02/17/07 04:40 am *** (CHG: Asg->NEW OWNER OWNER)
*** 02/17/07 05:45 am ***
Well, this seems to be same issue with bug#5562921.
heapdump level 2 shows
-------
*** ID:(20.280) 2007-02-17 16:17:31.083
KGH Latch Directory Information
ldir state: 2  next slot: 255
         :
Slot [253] Latch: c000000190c02cb0  Index: 2  Flags:  3  State: 2  next:
0000
01a65b1420
Slot [254] Latch: c000000190c02b98  Index: 1  Flags:  3  State: 2  next:
0000
01a65b1438
This means we have registered 255 latches to kgh latch directory, and it is full. (See bug#5508574 update by Joan at 09/08/06 12:24 pm, there is small bug in this print routine and 255 is not next slot, but last slot)
ORA-600 means latch hierarchy violation. We tried to get 2nd child shared
pool latch with wait get, that is not permitted.
This happend when freeing library cache object from shared pool, we need to get correct library cache latch to do it. However, probablly due to kgl latch is not registered correctly to kgh latch directory, we end up with requesting latch with wrong order, or strange latch. kgl latch is not registered correctly because kgh latch directory is full.It has 255 slots, and it seems full. We regsiter many latches to kgh latchdirectory, however, if kgl latch number is big, we will fill kgh latch
directory and some latch is not registered correctly.
*** 02/17/07 05:47 am ***
In my 9.2.0.8 env (with 2 cpu) we register 208 latches.
number of latches which will be allocated to kgh latch directory is not calculated easily. because we register many kind of latches.
heapdump level 2 will print latch address registered to kgh, so you can find the name of latch via v$latch, v$latch_children and v$latchname.
To see if an instance gets this problem, taking heapdump level 2 and check "next slot:" value. If it is 255, most likely that instance gets
this problem.
*** 02/17/07 05:48 am ***
*** 02/17/07 06:09 am ***
*** 02/17/07 06:35 am ***
*** 02/18/07 04:14 pm *** (CHG: Sta->36 SubComp->VOS HEAP MGMT)
*** 02/18/07 06:51 pm ***
From 9.2.0.8, we allocate many row cache latches compared to elder PSR.This is due to enhauncement introduced in 5040691.So, disable this fix is another workaround.If I set _more_rowcache_latches = false in init.ora,I see (only) 238 latches are registered to kgh latch directory even I set _kgl_latch_count=67 _kghdsidx_count=3
If I don't set _more_rowcache_latches = false,I can see ORA-600 simply startup and shutdown database

信息2
Symptoms
Mon Nov 22 08:50:45 2010
Errors in file /home/oracle/admin/orcl/udump/orcl_ora_127.trc:
ORA-00600: internal error code, arguments: [504], [0x380068D90], [160], [7], [shared pool], [2], [0], [0x380068CA0]
Mon Nov 22 08:51:03 2010
Errors in file /home/oracle/admin/orcl/udump/orcl_ora_444.trc:
ORA-00600: internal error code, arguments: [504], [0x380068E80], [160], [7], [shared pool], [3], [0], [0x380068D90]
The following query returns _kgl_latch_count > 31
SQL>
select a.ksppinm aa, b.ksppstvl bb
from x$ksppi a, x$ksppsv b
where a.indx=b.indx
and a.ksppinm='_kgl_latch_count';

OR
the query may return query return that _kgl_latch_count is 0, but when issuing a 'show parameter cpu_count', it returns a value of 32 or greater.
SQL> show parameter cpu_count;
NAME                                            TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                                       integer        128

Cause

This is Bug 5888835 ORA-600 [504] DURING FLUSH SHARED_POOL
closed as a duplicate of
Base Bug 5508574  ORA-00600 [99999] , ORA-07445 [KGSCDUMP()+673]
The latch directory size exceeds 255 when _kgl_latch_count > 31.
However, even when the _kgl_latch_count is equal to 0 (default value), if the cpu_count is >=32 the bug still applies.
This is due to the as the _kgl_latch_count default value is calculated as next prime number after the value returned by cpu_count. So, this bug could still apply if the cpu_count=32 as the _kgl_latch_count would be calcuated to the next prime number would be 37.
Solution
1.  Upgrade to the 10.2.0.4 patchset or the 11g release.
OR
2. You can use workaround of setting parameter _kgl_latch_count <= 31 in your pfile/spfile.
    (please note that this may adversely affect the concurrency)
OR
3.   If available for your platform/version, download and apply Patch 5508574
信息3

ora-00600 错误查询地址:http://metalink.oracle.com/metalink/plsql/ml2_documents.showNOT?p_id=153788.1&p_showHeader=1&p_showHelp=0
错误描述:ORA-00600: internal error code, arguments: [504], [0x38006BF08], [160], [7], [shared pool], [5], [0], [0x38006BE18]:
Description
An ORA-600 [504] can occur on the "shared pool" latch while
freeing a kglf heap. kglfall() and kghfrunp() will be in the call stack trace.
Workaround:
Set _kghdsidx_count=1 to disable multiple shared pool subpools
决定修改_kghdsidx_count。
select a.ksppinm,b.ksppstvl from x$ksppi a,x$ksppsv b where a.indx=b.indx and a.ksppinm='_kghdsidx_count';
KSPPINM            KSPPSTVL
---------         ---------
_kghdsidx_count     7
alter system set "_kghdsidx_count"=1 scope=spfile;
重启数据库,实际效果如何有待检验。

基本上可以可以得出结论是由于Oracle 9.2.0.8在处理latch时出现的bug,解决方式
1、将Oracle升级到10.2.0.4或者11.1g
2、为Oracle打Patch 5508574
3、将_kghdsidx_count修改成较小的值(虽然资料上显示都是将_kghdsidx_count修改为1,但基于笔者对_kghdsidx_coun的理解,个人感觉应调小该参数即可,没有必要调整为1),将_kgl_latch_count调整为小等于31的值,例如30.

下面介绍下_kghdsidx_count和_kgl_latch_count这两个参数。
_kgl_latch_count

It sets the number of child library cache latches. The default is the least prime number greater than or equal to cpu_count. The maximum is 67. It can safely be increased to combat library cache latch contention, as long as you stick to prime numbers. However it is only effective if the activity across the existing child library cache latches is evenly distributed as shown in V$LATCH_CHILDREN.
adjusting _kgl_latch_count is normally effective to reduce library cache latch contention. But stick to prime numbers less than or equal to 67, and no larger than necessary.

oracle提供了多个library cache latch(这样,每个library cache latch都称为子latch)来保护library cache中的bucket。这些子latch的数量由一个隐藏参数决定:_kgl_latch_count。该参数缺省值为大于等于系统中CPU个数的最小的素数。比如在一个具有4个CPU的生产环境中,library cache latch的个数为5,如下所示。但是oracle内部(9i版本)规定了library cache latch的最大个数为67,即便将这个隐藏参数设置为100,library cache latch的数量也还是67个。

注意:我们去查询_kgl_latch_count有时候显示为0,这是一个bug。

._kghdsidx_count

Oracle 9 ( 大概是9204)之前,shared_pool太大的情况下,会由于shared_pool free list 或者 used list太长造成一些性能问题。Oracle 9204( 大概)之后,增加了shared_pool sub pool的概念,一个大的shared_pool(> 400M左右),会被分割成几个sub pool,每个sub poll有自己的free list, used list. 这样,就不会因shared_pool太大造成性能问题。sub pool的数量通常是默认的,Oracle根据shared pool大小决定的,也可以由 _kghdsidx_count 来决定。

在Oracle 9i中,为了增加对于大共享池的支持,Shared Pool Latch从原来的1个增加到现在的7个。如果系统有4个或4个以上的CPU,并且shared_pool_size大于250MB,Oracle可以把Shared Pool分割为多个子缓冲池进行管理,每个subpool都拥有独立的结构、LRU和Shared Pool Latch。以下查询显示的就是这些Latch:
SQL> select addr, name, gets, misses, spin_gets  from v$latch_children where name = 'shared pool';
select * from x$ksppi a,x$ksppsv b where a.indx=b.indx and a.ksppinm='_kgl_latch_count'


参考至:http://www.ixora.com.au/q+a/library.htm

           http://www.itpub.net/thread-913955-1-1.html

           http://space.itpub.net/464838/viewspace-588908

           http://tech.it168.com/oldarticle/2007-07-04/200707041137531_3.shtml

           http://www.linuxidc.com/Linux/2011-07/39425p3.htm

           http://cuuzhang.blog.163.com/blog/static/608115292008624101721121/

           http://hi.baidu.com/lichangzai/blog/item/4d26c6fd0943770208244d53.html
           http://blog.sina.com.cn/s/blog_4d22b9720100n4vu.html
           http://zhangxu777.blog.163.com/blog/static/146290921200961655213/

本文原创,转载请注明出处、作者

如有错误,欢迎指正

邮箱:[email protected]

猜你喜欢

转载自czmmiao.iteye.com/blog/1214001