Talking about Oracle RAC-Cluster Management Software GI

Talking about Oracle RAC-the basic architecture of cluster management software GI

Today Friday, I think I can spend the weekend, and I am in a good mood. The most favorite thing of the week is Friday night, and the least favorite is Sunday night and Monday. It seems that I am not a person who loves labor. Taking advantage of the good mood now, sit down and continue writing my blog.

In yesterday’s blog, I introduced what Oracle RAC is. Those who have not seen it can refer back to it. We say that Oracle RAC is a database built based on cluster management software in terms of implementation technology. Then the basis of studying Oracle RAC is to understand the principles of cluster management software.

In yesterday’s blog, we introduced that Oracle no longer relies on third-party cluster management software, but has its own, standardized and powerful cluster management software-Grid Infrastructure (GI).

1. The basic structure of GI

Let's take 19c as an example to take a look at what components and resources GI is made of.

Insert picture description here
After reading the above picture, many students may be shocked whether GI has too many components and resources. In fact, you don't need to worry about it, but some large structures must be clearly structured. As long as we master these hierarchical relationships, this picture becomes simple.

To make a long story short, let's first extract the core components of GI.

1.1 OHASD

OHASD is the cluster startup daemon. From the figure above, we can see that OHASD starts three processes ending in agent, which we call agent processes. These agent processes are actually used to start/stop/monitor the managed components or resources. It is as if OHASD is a chairman, and three managers are assigned to manage the concepts of different departments.
Among the many components and resources managed by OHASD's agent process, there are several core components, namely CSSD and CRSD.

1.2 CSSD

This component is responsible for building the cluster and maintaining the consistency of the cluster. The reason why a cluster is called a cluster must be to link different computers together through some mechanism. CSSD is the core component to maintain cluster consistency.
Each node has a CSSD daemon. These processes communicate through a private network and periodically send network heartbeats to other nodes to confirm the communication status between different nodes. At the same time, the CSSD of each node will periodically send disk heartbeats to the shared disk to ensure that all member nodes can read and write the common disk through IO.

1.3 CRSD

This component is mainly responsible for managing resources in the cluster.
If we look at the picture above, CRSD will also generate a proxy process. These agent processes will manage many resources beginning with ora. What needs to be emphasized here is that not all resources start with ora. Generally speaking, the resources beginning with ora are the resources that come with Oracle. The customer-defined resources do not necessarily start with ora.
Let's take a closer look at the resources managed by CRSD and see if we found the resource name ora.DB.db. That's right, this is how the Oracle RAC instance appears in the cluster. The Oracle RAC database instance is managed by the CRSD component in the form of resources. The DB here is the name of the database you created.

After capturing the above three core components, let's smooth the agent process.

They are:
1.1 orarootagent, cssdagent and oraagent started by OHASD.
1.2 The orarootagent and oraagent started by CRSD.

It can be seen from the above that the agent process can only be generated by OHASD and CRSD, and no other components will generate agent processes.

Extract these three important components, and then follow the components or resources under the agent process to understand whether this picture is not that complicated.

Of course, the components and resources inside have their own unique functions, and we will find time to expand in the future.

2. GI startup

Next, let us start GI, and then look at the startup sequence.
The start command of GI is crsctl start crs. I added -wait at the end to print the startup information to the screen. If -wait is not added, no information is output. The execution authority of crsctl is the root user.

[root@node1 ~]#crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'node1'
CRS-2672: Attempting to start 'ora.evmd' on 'node1'
CRS-2676: Start of 'ora.mdnsd' on 'node1' succeeded
CRS-2676: Start of 'ora.evmd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'node1'
CRS-2676: Start of 'ora.gpnpd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'node1'
CRS-2676: Start of 'ora.gipcd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.crf' on 'node1'
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'
CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'node1'
CRS-2672: Attempting to start 'ora.diskmon' on 'node1'
CRS-2676: Start of 'ora.diskmon' on 'node1' succeeded
CRS-2676: Start of 'ora.crf' on 'node1' succeeded
CRS-2676: Start of 'ora.cssd' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'node1'
CRS-2672: Attempting to start 'ora.ctssd' on 'node1'
CRS-2676: Start of 'ora.ctssd' on 'node1' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'node1'
CRS-2676: Start of 'ora.asm' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'node1'
CRS-2676: Start of 'ora.storage' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.crsd' on 'node1'
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded
CRS-6023: Starting Oracle Cluster Ready Services-managed resources
CRS-6017: Processing resource auto-start for servers: node1
CRS-2672: Attempting to start 'ora.ons' on 'node1'
CRS-2672: Attempting to start 'ora.chad' on 'node1'
CRS-2676: Start of 'ora.chad' on 'node1' succeeded
CRS-2676: Start of 'ora.ons' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.DATA.dg' on 'node1'
CRS-2676: Start of 'ora.DATA.dg' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.orcl.db' on 'node1'
CRS-2676: Start of 'ora.orcl.db' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.orcl.orcltest.svc' on 'node1'
CRS-2676: Start of 'ora.orcl.orcltest.svc' on 'node1' succeeded
CRS-6016: Resource auto-start has completed for server node1
CRS-6024: Completed start of Oracle Cluster Ready Services-managed resources
CRS-4123: Oracle High Availability Services has been started.

From the above output information, we first saw CRS-4123.

CRS-4123: Starting Oracle High Availability Services-managed resources

From this content, we can see that we need to start the resources managed by OHAS. Then output the startup information of the components and resources managed by OHAS. Here we can see that CSSD and CRSD, which we care about most, have been activated.

CRS-2676: Start of 'ora.cssd' on 'node1' succeeded
CRS-2676: Start of 'ora.crsd' on 'node1' succeeded

Then we saw the output of CRS-6023.

CRS-6023: Starting Oracle Cluster Ready Services-managed resources

Looking at the content of the CRS-6023 information, the next step is to start the resources managed by CRSD. Then the database resources ora.orcl.db are all started.

When the startup is successful, the following information will be output.

CRS-4123: Oracle High Availability Services has been started

From the above information output, we can see that the GI startup is divided into two major modules. One is the resource startup module managed by OHASD, and the other is the resource startup module managed by CRSD.

Through the above startup output information and then match the above structure diagram, can we understand the basic framework of GI more clearly.

3. GI related process confirmation

Whether we are talking about components or resources above, it must be manifested in various processes on the OS.
Through the ps command we can see the main process of GI.

[root@node1 ~]# ps -ef | grep /u01/64bit/app/19.3.0/grid/bin/
root     24048     1  3 11:53 ?        00:00:48 /u01/64bit/app/19.3.0/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.US7ASCII
root     24452     1  0 11:54 ?        00:00:12 /u01/64bit/app/19.3.0/grid/bin/orarootagent.bin
grid     24629     1  1 11:54 ?        00:00:13 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
grid     24674     1  0 11:54 ?        00:00:06 /u01/64bit/app/19.3.0/grid/bin/mdnsd.bin
grid     24676     1  1 11:54 ?        00:00:17 /u01/64bit/app/19.3.0/grid/bin/evmd.bin
grid     24755     1  0 11:54 ?        00:00:07 /u01/64bit/app/19.3.0/grid/bin/gpnpd.bin
grid     24832 24676  0 11:54 ?        00:00:06 /u01/64bit/app/19.3.0/grid/bin/evmlogger.bin -o /u01/64bit/app/19.3.0/grid/log/[HOSTNAME]/evmd/evmlogger.info -l /u01/64bit/app/19.3.0/grid/log/[HOSTNAME]/evmd/evmlogger.log
grid     24892     1  1 11:54 ?        00:00:16 /u01/64bit/app/19.3.0/grid/bin/gipcd.bin
root     25139     1  0 11:54 ?        00:00:07 /u01/64bit/app/19.3.0/grid/bin/cssdmonitor
root     25142     1  4 11:54 ?        00:01:00 /u01/64bit/app/19.3.0/grid/bin/osysmond.bin
root     25191     1  0 11:54 ?        00:00:07 /u01/64bit/app/19.3.0/grid/bin/cssdagent
grid     25249     1  3 11:54 ?        00:00:44 /u01/64bit/app/19.3.0/grid/bin/ocssd.bin
root     25839     1  1 11:54 ?        00:00:17 /u01/64bit/app/19.3.0/grid/bin/octssd.bin reboot
root     26294     1  2 11:54 ?        00:00:27 /u01/64bit/app/19.3.0/grid/bin/crsd.bin reboot
root     26857     1  1 11:55 ?        00:00:22 /u01/64bit/app/19.3.0/grid/bin/orarootagent.bin
grid     27514     1  2 11:55 ?        00:00:26 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
grid     27651     1  0 11:55 ?        00:00:00 /u01/64bit/app/19.3.0/grid/bin/tnslsnr LISTENER -no_crs_notify -inherit
grid     27757     1  0 11:55 ?        00:00:00 /u01/64bit/app/19.3.0/grid/bin/tnslsnr ASMNET1LSNR_ASM -no_crs_notify -inherit
oracle   28900     1  0 11:55 ?        00:00:08 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
root     29849 25142  0 11:56 ?        00:00:03 /u01/64bit/app/19.3.0/grid/perl/bin/perl /u01/64bit/app/19.3.0/grid/bin/diagsnap.pl start

From the above we can see that the user of the agent process oraagent with pid=28900 is oracle. This is because when I installed this set of Oracle RAC, the GI software was installed by the grid user, and the DB software was installed by the oracle user. So in order to manage the database, the agent process generated by CRSD belongs to oracle.

grid     24629     1  1 11:54 ?        00:00:13 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
grid     27514     1  2 11:55 ?        00:00:26 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
oracle   28900     1  0 11:55 ?        00:00:08 /u01/64bit/app/19.3.0/grid/bin/oraagent.bin
root     24452     1  0 11:54 ?        00:02:06 /u01/64bit/app/19.3.0/grid/bin/orarootagent.bin
root     26857     1  1 11:55 ?        00:04:38 /u01/64bit/app/19.3.0/grid/bin/orarootagent.bin

The user of the agent process oraagent with pid=24629 and 27514 is the grid. The users of the agent process orarootagent with pid=24452 and 26857 are also grids. How can an agent process with the same name appear? When we return to the above figure, we can see that both OHASD and CRSD have generated agent processes called oraagent and orarootagent.

Under normal circumstances, we do not need to distinguish the pids of these agent processes. Because we only need to look at the trace logs of these agent processes when investigating problems. The trace log file names of these agent processes have pre-names. It is easy to distinguish the log of the agent process to be viewed.

/u01/64bit/app/grid/diag/crs/node1/crs/trace/crsd_oraagent_grid.trc
/u01/64bit/app/grid/diag/crs/node1/crs/trace/ohasd_oraagent_grid.trc
/u01/64bit/app/grid/diag/crs/node1/crs/trace/crsd_orarootagent_root.trc
/u01/64bit/app/grid/diag/crs/node1/crs/trace/ohasd_orarootagent_root.trc

But when you need to manually kill the agent process, it is necessary to find the object to be killed. At this time, how do we confirm which one is the agent process of OHASD and which one is the agent process of CRSD?

In fact, we can make a judgment based on the cluster's alarm log. The following is a summary of the output of the alarm log when the cluster is started.

2021-03-19 11:54:04.297 [OHASD(24048)]CRS-1301: Oracle High Availability Service started on node node1.
...
2021-03-19 11:54:09.525 [ORAROOTAGENT(24452)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 24452
2021-03-19 11:54:12.075 [ORAAGENT(24629)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 24629
...
2021-03-19 11:55:01.612 [CRSD(26294)]CRS-1201: CRSD started on node node1.
...
2021-03-19 11:55:04.324 [ORAROOTAGENT(26857)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 26857
2021-03-19 11:55:19.113 [ORAAGENT(27514)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 27514

The output behind CRS-1301 is the orarootagent with pid=24452 and the oraagent process with pid=24629. So we can determine that it is the agent process started by OHASD.
The output behind CRS-1201 is the orarootagent with pid=26857 and the oraagent process with pid=27514. So we can determine that this is an agent process started by CRSD.

Regardless of whether it is GI startup or GI-related processes mentioned above, they are actually trying to help everyone understand the basic structure of GI that was initially thrown out.

As long as we have a relatively clear understanding of the basic GI structure diagram above, it will be very helpful for us to study GI issues in the future. It can be said that the picture above is the basis for studying GI problems.

4. Introduction to GI's log

We suddenly mentioned the trace log of the agent process and the alarm log of GI above. It looks a little abrupt. As a supplement, I will briefly introduce some GI log systems. It can be said that the study of GI issues, any research and analysis out of the log is a rogue.

GI's log system has a big change in 12.2. Because the version before 18c has passed Premier Support. So I only introduce the log system after 12.2.

Most of GI's logs are stored in the following path:

<Grid_Base>/diag/crs/<hostname>/crs/trace/

The alarm log is a comprehensive output log of GI. Any important information in the GI component will be output to this log. Its nature is somewhat similar to the alarm log of the database. Then the various components of GI and the agent process also have their own trace logs.

When we are investigating problems, we often need to carefully check the logs of resources or components and related agent process logs. Because these components and agent processes and resources do not exist independently, we need to trace the inner action relationship from them.

In addition, these logs have their own rotation mechanism. If I have time in the future, I will write a special topic to introduce GI's log system, and I will not continue to expand it here.


The GI architecture is indeed more complicated, and students who are just getting started may often be at a loss. I try to help you understand it in a summary way from your own perspective. If you have any questions, you can leave a message for me.

It's too late, so let's write here first, good night!

Guess you like

Origin blog.csdn.net/weixin_50510978/article/details/115013517