Red alert ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM and DB restart due to cluster heartbeat loss

    ​ ORACLE RAC 11.2.0.4 FOR SOLARIS 10 ASM and DB restart due to cluster heartbeat loss. The problem is BUG 10194190 18740837

As a result, patch 25142535 is required to fix the problem. There is a customer's solaris oracle 11.2.0.4 rac. The DB of all nodes has been patched, but

After the host was running for 248 days, Oracle 11.2.0.4 running on it still restarted due to the loss of the cluster heartbeat. According to the official reply of ORACLE MOS, fix the bug

One of patch 18740837, patch 18740837, needs to be applied on the GI software at the same time.

    ​ Below is a summary of the problem analysis and solution handling of related customer cases.

    ​ 1. The DB alarm log of the problem node reports an error message

    ​2. The ASM alarm log of the problem node reports an error message

    ​ 3. +ASM2_lmhb_4609_i78497.trc file content

    ​ 4. OCSSD log prompts heartbeat timeout

    ​ 5. +ASM2_lmhb_4609_i78497.trc file partial information

===[ Session State Object ]===

  ----------------------------------------

  SO: 0x3ffdb38d8, type: 4, owner: 0x400b0c258, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

   proc=0x400b0c258, name=session, file=ksu.h LINE:12729 ID:, pg=0

  (session) sid: 145 ser: 1 trans: 0x0, creator: 0x400b0c258

            flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-

            flags2: (0x409) -/-/INC

            DID: , short-term DID:

            txn branch: 0x0

            edition#: 0            oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS

  ksuxds FALSE at location: 0

  service name: SYS$BACKGROUND

  Current Wait Stack:

   0: waiting for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476951 seq_num=11521 snap_id=1

      wait times: snap=2 min 46 sec, exc=2 min 46 sec, total=2 min 46 sec

      wait times: max=0.100000 sec, heur=2 min 46 sec

      wait counts: calls=1 os=1

      in_wait=1 iflags=0x5a8

  Wait State:

    fixed_waits=0 flags=0x22 boundary=0x0/-1

  Session Wait History:

      elapsed time of 0.000015 sec since current wait

   0: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476950 seq_num=11520 snap_id=1

      wait times: snap=0.000027 sec, exc=0.000027 sec, total=0.000027 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000138 sec of elapsed time

   1: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476949 seq_num=11519 snap_id=1

      wait times: snap=0.102094 sec, exc=0.102094 sec, total=0.102094 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000015 sec of elapsed time

   2: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476948 seq_num=11518 snap_id=1

      wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000133 sec of elapsed time

   3: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476947 seq_num=11517 snap_id=1

      wait times: snap=0.102074 sec, exc=0.102074 sec, total=0.102074 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

   4: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476946 seq_num=11516 snap_id=1

      wait times: snap=0.000023 sec, exc=0.000023 sec, total=0.000023 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000119 sec of elapsed time

   5: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476945 seq_num=11515 snap_id=1

      wait times: snap=0.103086 sec, exc=0.103086 sec, total=0.103086 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000012 sec of elapsed time

   6: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476944 seq_num=11514 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000152 sec of elapsed time

   7: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476943 seq_num=11513 snap_id=1

      wait times: snap=0.103090 sec, exc=0.103090 sec, total=0.103090 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000016 sec of elapsed time

   8: waited for 'CGS wait for IPC msg'

      =0x0, =0x0, =0x0

      wait_id=432476942 seq_num=11512 snap_id=1

      wait times: snap=0.000025 sec, exc=0.000025 sec, total=0.000025 sec

      wait times: max=0.000000 sec

      wait counts: calls=1 os=1

      occurred after 0.000272 sec of elapsed time

   9: waited for 'rdbms ipc message'

      timeout=0xa, =0x0, =0x0

      wait_id=432476941 seq_num=11511 snap_id=1

      wait times: snap=0.102084 sec, exc=0.102084 sec, total=0.102084 sec

      wait times: max=0.100000 sec

      wait counts: calls=1 os=1

      occurred after 0.000013 sec of elapsed time

  Sampled Session History of session 145 serial 1

  ---------------------------------------------------

  The sampled session history is constructed by sampling

  the target session every 1 second. The sampling process

  captures at each sample if the session is in a non-idle wait,

  an idle wait, or not in a wait. If the session is in a

  non-idle wait then one interval is shown for all the samples

  the session was in the same non-idle wait. If the

  session is in an idle wait or not in a wait for

  consecutive samples then one interval is shown for all

  the consecutive samples. Though we display these consecutive

  samples  in a single interval the session may NOT be continuously

  idle or not in a wait (the sampling process does not know).

  The history is displayed in reverse chronological order.

  sample interval: 1 sec, max history 120 sec

  ---------------------------------------------------

    [121 samples,                                          21:08:20 - 21:10:20]

      idle wait at each sample

  temporary object counter: 0

    ----------------------------------------

    Virtual Thread:

    kgskvt: 3fc434ce8, sex: 3ffdb38d8 sid: 145 ser: 1

    vc: 0, proc: 400b0c258, id: 145

    consumer group cur:  (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:

    vt_state: 0x100, vt_flags: 0x4030, blkrun: 0, numa: 1

    inwait: 0, short wait event: 0 posted_run: 0

    location where insched last set: kgskthrrun

    location where insched last cleared: kgskthrrun1

    location where inwait last set: NULL

    location where inwait last cleared: NULL

    is_assigned: 0, in_sched: 0 (0)

    qcls: 0, qlink: FALSE

    vt_active: 0 (pending: 0)

    vt_pq_active: 0, dop: 0

    used quanta (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu start time: 0

    idle time: 0, active time: 0 (cg: 0)

    cpu yields:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    cpu wait time (usecs):

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io waits:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    io wait time:

    stmt: 0, accum: 0, mapped: 0, tot: 0

    ASL queued time outs: 0, time: 0 (cur 0, cg 0)

    PQQ queued time outs: 0, time: 0 (cur 0, cg 0)

    Queue timeout violation: 0

    calls aborted: 0, num est exec limit hit: 0

    undo current: 0k max: 0k

    I/O credits acquired:small=0 large=0

    I/O credits waiting for:small=0 large=0

    KTU Session Commit Cache Dump for IDLs:

    KTU Session Commit Cache Dump for Non-IDLs:

    ----------------------------------------

    KKS-UOL used : 0 locks(used=0, free=0)

    KGX Atomic Operation Log 3eae04a58

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04aa8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04af8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    KGX Atomic Operation Log 3eae04b48

     Mutex 0(0, 0) idn 0 oper NONE(0)

     FSO mutex uid 145 efd 0 whr 0 slp 0

    ----------------------------------------

    KGL-UOL SO Cache (total = 0, free = 0)

    KGX Atomic Operation Log 3eae047a0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae047f8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae04850

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae048a8

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae04900

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae04958

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    KGX Atomic Operation Log 3eae049b0

     Mutex 0(0, 0) idn 0 oper NONE(0)

     Library Cache uid 145 efd 0 whr 0 slp 0

     oper=0 pt1=0 pt2=0 pt3=0

     pt4 = 0 pt5 = 0 ub4 = 0

    ----------------------------------------

    SO: 0x3f464e838, type: 57, owner: 0x3ffdb38d8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3

     proc=0x400b0c258, name=dummy, file=ktccts.h LINE:415 ID:, pg=0

    (dummy) nxc = 0, nlb = 0  

===[ Callstack ]===

*** 2019-05-21 21:10:20.275

Process diagnostic dump for oracle@sm0ora10 (LMON), OS id=4604,

pid: 9, proc_ser: 1, sid: 145, sess_ser: 1

-------------------------------------------------------------------------------

os thread scheduling delay history: (sampling every 1.000000 secs)

  0.000000 secs at [ 21:10:19 ]

    NOTE: scheduling delay has not been sampled for 0.505539 secs  0.000000 secs from [ 21:10:15 - 21:10:20 ], 5 sec avg

  0.000000 secs from [ 21:09:20 - 21:10:20 ], 1 min avg

  0.000000 secs from [ 21:05:21 - 21:10:20 ], 5 min avg

*** 2019-05-21 21:10:21.374

loadavg : 20.48 16.00 13.13

swap info: free_mem = 194124.91M rsv = 258509.04M

           alloc = 239225.71M avail = 235374.84M swap_free = 254658.16M

 F S      UID   PID  PPID   C PRI NI     ADDR     SZ    WCHAN    STIME TTY         TIME CMD

 0 O     grid  4604  4596   1  79 20        ? 168184            Sep 15 ?         781:33 asm_lmon_+ASM2

Short stack dump:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

*** 2019-05-21 21:10:22.421

==============================

LMON (ospid: 4604) has not moved for 176 sec (1558444222.1558444046)

kjzduptcctx: Notifying DIAG for crash event

----- Abridged Call Stack Trace -----

ksedsts()+1320<-kjzdicrshnfy()+388<-ksuitm()+1084<-kjgcr_KillInstance()+148<-kjgcr_Main()+7564<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380

----- End of Abridged Call Stack Trace -----

    6. Host startup time

root@0ora10# hostname

10 o'clock

root@ora10# uname -a

SunOS ora10 5.11 11.3 sun4v sparc sun4v

root@# uptime

 11:28pm  up 248 day(s), 15:48,  8 users,  load average: 6.21, 7.27, 8.09

    ​6. According to ORA-15046 and ORA-29770 query and TRC stack call information prompt:

ksedsts()+380<-ksdxfstk()+52<-ksdxcb()+3592<-sspuser()+140<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-_syscall6()+32<-thr_sigsetmask()+512<-sslssalck()+152<-sskgxp_alarm_set()+44<-skgxp_twait()+376<-sslsshandler()+712<-__sighndlr()+12<-call_user_handler()+868<-sigacthandler()+92<-__pollsys()+8<-_pollsys()+232<-_poll()+140<-ssskgxp_poll()+36<-sskgxp_selectex()+224<-skgxpiwait()+6456<-skgxpwaiti()+5044<-skgxpwait()+984<-ksxpwait()+3360<-ksliwat()+10676<-kslwait()+240<-ksarcv()+212<-kjclwmg()+36<-kjfcln()+4284<-ksbrdp()+1720<-opirip()+1680<-opidrv()+748<-sou2o()+88<-opimai_real()+512<-ssthrdmain()+324<-main()+316<-_start()+380ORACLE MOS官网,发现完全匹配上:(Doc ID 2159643.1)

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    ​7. Follow-up check the installation of the problematic system patch, and found that the DB of all rac nodes has been applied with the patch 25142535 mentioned in Doc ID 2159643.1

oracle@ora10$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora10$

oracle@ora11$opatch lsinventory|grep 25142535

Patch  25142535     : applied on Tue Jul 31 16:26:30 GMT+08:00 2018

oracle@ora11$

    ​ But GRID GI did not apply patch 25142535

grid@ora10$ opatch lsinventory|grep  25142535

grid@ora10$ 

grid@ora11$ opatch lsinventory|grep  25142535

grid@ora11$ 

    ​ 8. Consult MOS on this issue, the official response given by ORACLE MOS did not clearly indicate whether patch 25142535 really needs to be applied on GI

    ​ 9. Patch 25142535 is a collection of patches 18740837 and 10194190, of which the readme of the 18740837 small patch indicates that the RAC environment needs to be applied in GI

    ​10. Problem analysis conclusion:

    ​From the current actual problem situation and the information of 8 and 9, it can be determined that the patch 25142535 needs to be installed on all nodes of oracle 11.2.0.4 rac for solaris

Application on GI cluster software. To be sure: the current oracle 11.2.0.4 rac for solaris asm and db crash and restart is indeed rac in

It is caused by running on solaris for 248 days, patching 25142535 on the DB software of all nodes in the cluster cannot be solved, and the official has not clearly stated that it should be applied to the GI software at the same time.

Can it be solved after patch 25142535:

Solaris: Process spins/ASM and DB Crash if RAC Instance Is Up For > 248 Days by LMHB with ORA-29770 

    ​

    ​If customers with similar problems see the blog post, if there is a GI patch 25142535 to solve this problem, or to solve this problem through other means, welcome:

Comment and leave a message, thanks!

Guess you like

Origin blog.csdn.net/www_xue_xi/article/details/90483212