ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM and DB restart due to cluster heartbeat loss. The problem is BUG 10194190 18740837
As a result, patch 25142535 is required to fix the problem. There is a customer's solaris oracle 11.2.0.4 rac. The DB of all nodes has been patched, but
After the host was running for 248 days, Oracle 11.2.0.4 running on it still restarted due to the loss of the cluster heartbeat. According to the official reply of ORACLE MOS, fix the bug
One of patch 18740837, patch 18740837, needs to be applied on the GI software at the same time.
Below is a summary of the problem analysis and solution handling of related customer cases.
1. The DB alarm log of the problem node reports an error message
2. The ASM alarm log of the problem node reports an error message
3. +ASM2_lmhb_4609_i78497.trc file content
4. OCSSD log prompts heartbeat timeout
5. +ASM2_lmhb_4609_i78497.trc file partial information
===[ Session State Object ]===
----------------------------------------
SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0
(session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: 0x0
edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476951 seq_num=11521 snap_id=1
wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec
wait times: max=0.100000 sec, heur=2 min 46 sec
wait counts: calls=1 os=1
in_wait=1 iflags=0x5a8
Wait State:
fixed_waits=0 flags=0x22 boundary=0x0/-1
Session Wait History:
elapsed time of 0.000015 sec since current wait
0: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476950 seq_num=11520 snap_id=1
wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000138 sec of elapsed time
1: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476949 seq_num=11519 snap_id=1
wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000015 sec of elapsed time
2: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476948 seq_num=11518 snap_id=1
wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000133 sec of elapsed time
3: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476947 seq_num=11517 snap_id=1
wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
4: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476946 seq_num=11516 snap_id=1
wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000119 sec of elapsed time
5: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476945 seq_num=11515 snap_id=1
wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000012 sec of elapsed time
6: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476944 seq_num=11514 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000152 sec of elapsed time
7: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476943 seq_num=11513 snap_id=1
wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000016 sec of elapsed time
8: waited for 'CGS wait for IPC msg'
=0x0, =0x0, =0x0
wait_id=432476942 seq_num=11512 snap_id=1
wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec
wait times: max=0.000000 sec
wait counts: calls=1 os=1
occurred after 0.000272 sec of elapsed time
9: waited for 'rdbms ipc message'
timeout=0xa, =0x0, =0x0
wait_id=432476941 seq_num=11511 snap_id=1
wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec
wait times: max=0.100000 sec
wait counts: calls=1 os=1
occurred after 0.000013 sec of elapsed time
Sampled Session History of session 145 serial 1
---------------------------------------------------
The sampled session history is constructed by sampling
the target session every 1 second. The sampling process
captures at each sample if the session is in a non-idle wait,
an idle wait, or not in a wait. If the session is in a
non-idle wait then one interval is shown for all the samples
the session was in the same non-idle wait. If the
session is in an idle wait or not in a wait for
consecutive samples then one interval is shown for all
the consecutive samples. Though we display these consecutive
samples in a single interval the session may NOT be continuously
idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec
---------------------------------------------------
[121 samples, 21:08:20 - 21:10:20]
idle wait at each sample
temporary object counter: 0
----------------------------------------
Virtual Thread:
kgskvt: 3fc434ce8, sex: 3ffdb38d8 sid: 145 ser: 1
vc: 0, proc: 400b0c258, id: 145
consumer group cur: (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:
vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1
inwait: 0, short wait event: 0 posted_run: 0
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
location where inwait last set: NULL
location where inwait last cleared: NULL
is_assigned: 0, in_sched: 0 (0)
qcls: 0, qlink: FALSE
vt_active: 0 (pending: 0)
vt_pq_active: 0, dop: 0
used quanta (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu start time: 0
idle time: 0, active time: 0 (cg: 0)
cpu yields:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
cpu wait time (usecs):
stmt: 0, accum: 0, mapped: 0, tot: 0
io waits:
stmt: 0, accum: 0, mapped: 0, tot: 0
io wait time:
stmt: 0, accum: 0, mapped: 0, tot: 0
ASL queued time outs: 0, time: 0 (cur 0, cg 0)
PQQ queued time outs: 0, time: 0 (cur 0, cg 0)
Queue timeout violation: 0
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
I/O credits acquired:small=0 large=0
I/O credits waiting for:small=0 large=0
KTU Session Commit Cache Dump for IDLs:
KTU Session Commit Cache Dump for Non-IDLs:
----------------------------------------
KKS-UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log 3eae04a58
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04aa8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04af8
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
KGX Atomic Operation Log 3eae04b48
Mutex 0(0, 0) idn 0 oper NONE(0)
FSO mutex uid 145 efd 0 whr 0 slp 0
----------------------------------------
KGL-UOL SO Cache (total = 0, free = 0)
KGX Atomic Operation Log 3eae047a0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae047f8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae04850
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae048a8
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae04900
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae04958
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
KGX Atomic Operation Log 3eae049b0
Mutex 0(0, 0) idn 0 oper NONE(0)
Library Cache uid 145 efd 0 whr 0 slp 0
oper=0 pt1=0 pt2=0 pt3=0
pt4 = 0 pt5 = 0 ub4 = 0
----------------------------------------
SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0
(dummy) nxc = 0, nlb = 0
===[ Callstack ]===
*** 2019-05-21 21:10:20.275
Process diagnostic dump for oracle@sm0ora10 (LMON), OS id=4604,
pid: 9, proc_ser: 1, sid: 145, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 21:10:19 ]
NOTE: scheduling delay has not been sampled for 0.505539 secs 0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg
0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg
0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg
*** 2019-05-21 21:10:21.374
loadavg : 20.48 16.00 13.13
swap info: free_mem = 194124.91M rsv = 258509.04M
alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 O grid 4604 4596 1 79 20 ? 168184 Sep 15 ? 781:33 asm_lmon_+ASM2
Short stack dump:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
*** 2019-05-21 21:10:22.421
==============================
LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380
----- End of Abridged Call Stack Trace -----
6. Host startup time
root@0ora10# hostname
10 o'clock
root@ora10# uname -a
SunOS ora10 5.11 11.3 sun4v sparc sun4v
root@# uptime
11:28pm up 248 day(s), 15:48, 8 users, load average: 6.21, 7.27, 8.09
6. According to ORA-15046 and ORA-29770 query and TRC stack call information prompt:
ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380ORACLE MOS官网,发现完全匹配上:(Doc ID 2159643.1)
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
7. Follow-up check the installation of the problematic system patch, and found that the DB of all rac nodes has been applied with the patch 25142535 mentioned in Doc ID 2159643.1
oracle@ora10$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
oracle@ora10$
oracle@ora11$opatch lsinventory|grep 25142535
Patch 25142535 : applied on Tue Jul 31 16:26:30 GMT+08:00 2018
oracle@ora11$
But GRID GI did not apply patch 25142535
grid@ora10$ opatch lsinventory|grep 25142535
grid@ora10$
grid@ora11$ opatch lsinventory|grep 25142535
grid@ora11$
8. Consult MOS on this issue, the official response given by ORACLE MOS did not clearly indicate whether patch 25142535 really needs to be applied on GI
9. Patch 25142535 is a collection of patches 18740837 and 10194190, of which the readme of the 18740837 small patch indicates that the RAC environment needs to be applied in GI
10. Problem analysis conclusion:
From the current actual problem situation and the information of 8 and 9, it can be determined that the patch 25142535 needs to be installed on all nodes of oracle 11.2.0.4 rac for solaris
Application on GI cluster software. To be sure: the current oracle 11.2.0.4 rac for solaris asm and db crash and restart is indeed rac in
It is caused by running on solaris for 248 days, patching 25142535 on the DB software of all nodes in the cluster cannot be solved, and the official has not clearly stated that it should be applied to the GI software at the same time.
Can it be solved after patch 25142535:
Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770
If customers with similar problems see the blog post, if there is a GI patch 25142535 to solve this problem, or to solve this problem through other means, welcome:
Comment and leave a message, thanks!