How to diagnose Oracle RAC cluster nodes to expel (restart) problem

It applies to

Oracle Database - Enterprise Edition - version 11.2.0.1 to 12.1.0.2 [release 11.2 to 12.1]
Information in this document applies to all platforms

use

This article provides a reference method for the diagnosis of 11.2 and later cluster node evictions. For cluster nodes before 11.2 expulsion, please leave a message.

details

Summary node eviction

Oracle cluster found some serious problems will expel one or more nodes from the cluster. Such serious problems including no network node heartbeat, no disk heartbeat nodes, server is not responding or there is a serious performance problem or ocssd.bin no response. Node eviction aim is to maintain the health of the entire node by removing some of the nodes.
From 11.2.0.2 RAC (or Exadata), node eviction may not really restart the host. This is called rebootless restart. In this case, we will restart most of the cluster process to confirm whether the problem can be solved this node.

1.0 - will cause the process to restart

Ocssd (AKA CSS daemon) - the process initiated by the cssdagent process. For third-party Clusterware environment and no third-party cluster members have this process. OCSSD main role is to find health monitoring and database instance between nodes. Health monitoring, including network and disk heartbeat heartbeat (for elections disk). OCSSD after receipt of the client (such as LMON process database) of the member kill escalation request, you can also initiate node eviction. OCSSD process is running Oracle user, multi-threaded, high-level processes running.

Boot sequence: INIT -> init.ohasd -> ohasd -> ohasd.bin -> cssdagent -> ocssd -> ocssd.bin

CSSDAGENT - the process started by OHASD process, CSSDAGENT OCSSD used to start the process, it can monitor nodes hang (similar oprocd), as well as process monitoring OCSSD hang (similar oclsomon), but also to monitor third party Clusterware (similar vmon ). This process is run as a root user, multi-threaded, high-level processes running.

Boot sequence: INIT -> init.ohasd -> ohasd -> ohasd.bin -> cssdagent

CSSDMONITOR - this process will be monitored node hang (similar oprocd), as well as process monitoring OCSSD hang (similar oclsomon), but also to monitor third party Clusterware (similar vmon). This process is run as a root user, multi-threaded, high-level processes running.
Boot sequence: INIT -> init.ohasd -> ohasd -> ohasd.bin -> cssdmonitor

2.0 - Confirm which was launched by the restart process

Important documents you want to see:

  • Clusterware alert log in <GRID_HOME>/log/
  • The cssdagent log(s) in <GRID_HOME>/log//agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in <GRID_HOME>/log//agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in <GRID_HOME>/log//cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp 或者 /var/opt/oracle/lastgasp
  • IPD / OS or OS Watcher data
  • The GRID home 'opatch lsinventory -detail' output
  • Messages file:
    Messages file path:
    • Linux: / var / log / messages
    • Sun: /var/adm/messages
    • HP-UX: /var/adm/syslog/syslog.log
    • IBM: /bin/errpt -a > messages.out

In most cases, it will record meaningful diagnostic information in the alert log clusters when the cluster 11.2 expulsion. With this information, we can confirm which initiated the reboot process. The following is a sample of clusters alert log:

[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653

This is due to the expulsion of experiencing network timeouts caused by the problem. After the CSSD process exits, CSSDAGENT initiated the restart. CSSDAGENT is to obtain this information from the local heartbeat CSSD-related mistakes.

If there is no relevant information in the cluster of nodes being expelled in the alert log, the log of this check lastgasp cluster nodes and / or other nodes in the alert log.

3.0 - Diagnostic OCSSD initiated expulsion

If you encounter OCSSD initiated eviction, refer to the common causes listed in section 3.1:

3.1 - a common cause of OCSSD eviction

  • A network failure or the delay between the nodes. In 30 consecutive seconds (default value, misscount determined by CSS) After the heartbeat barrier, will lead to the expulsion of the node.
  • CSS can not read and write disk election. If a node can not complete disk heartbeat for most elections disk, the node will be evicted.
  • Member kill escalation. For example, LMON process database instance may request CSS will expel an instance from the cluster. If an instance of expulsion timeout, the node will be upgraded to expulsion.
  • OCSSD process error occurs or hang, this situation can be caused by any of the above circumstances or otherwise.
  • Oracle bug。

3.2 - need to collect and view files when OCSSD expulsion

In all the files for all nodes listed in section 2.0, you may need more information.

Since the expulsion of the sample disk problems caused by the election:

CSS log:

2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread

OS messages:

Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...

4.0 - Diagnostic CSSDAGENT expulsion or CSSDMONITOR

If you encounter CSSDAGENT or CSSDMONITOR expulsion, please refer to the common causes listed in section 4.1.

4.1 - a common cause of CSSDAGENT expulsion or CSSDMONITOR

  • OS scheduling problem. For example, OS encountered a driver, a hardware problem or host load is too high (CPU 100% used) and other issues, will lead to OS scheduling anomaly.
  • A CSSD or more threads hang.
  • Oracle bug。

4.2 - CSSDAGENT or CSSDMONITOR deported files need to collect and view

All files for all nodes listed in section 2.0, may need more information.
Reference documentation:
the Troubleshooting Clusterware the Node up evictions (reboots) (document ID 1050693.1)

Guess you like

Origin blog.csdn.net/baidu_39459954/article/details/80625509