Multiple listener failure case analysis

Shi Yunhua  Oracle all-in-one user group  published in Beijing on 2023-05-15 14:42 

Multiple listener failure case analysis

⒈ Case Overview

     

One night, friend A called me to chat and shared a recent case he had.

They have a set of 11g RAC database, and the SCAN listening and local listening ports are both 1521. In order to set up the DataGuard environment, a second listening LISTENER_DG (same listening address, port 1522) was created specifically for DataGuard use. This kind of planning is mainly to separate the network impact between the business system and DataGuard. The business system uses port 1521, while DataGuard uses port 1522. After they created the second listener, the business staff reported that when the business system connected to the database through the SCAN address, it could sometimes succeed, but sometimes the connection failed.

After investigation, he found that the newly established second listener had been automatically and dynamically registered in the local_listener parameter. The SCAN listener forwarded the new connection request issued by the business system to the 1522 port of the local listener LISTENER_DG, and the host where the business system is located and the database host Due to security policy restrictions, only port 1521 was opened for network communication, so the business system could not connect to the database.

In the end, friend A's solution to the problem was to clear the 1522 port entry in local_listener, modify the listener.ora file, and change the second listener from dynamic listening to static listening.

Regarding the case shared by friend A, in fact, I only remembered the phenomenon and solution of the fault that night, and did not delve into the technical details. About a month later, friend B called me and said he encountered a fault and asked me to help analyze it. After friend B described the fault phenomenon to me, I immediately realized that this fault was the fault that friend A had just encountered, so I relayed friend A's case to friend B.

In order to clarify some technical details of this fault, I reproduced the fault in the test environment and gave a second solution to the fault.

2. Fault recurrence

In order to further understand the technical details of the entire failure, I set up a test environment, as shown below.

1. Parameter description related to test environment and monitoring

[grid@11grac1 admin]$ lsnrctl status listener_scan1

. . . . . . (slightly)    

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.124)(PORT=1521)))

Services Summary...

Service "cdb" has 2 instance(s).

  Instance "cdb1", status READY, has 1 handler(s) for this service...

  Instance "cdb2", status READY, has 1 handler(s) for this service...

The command completed successfully

[grid@11grac1 admin]$

[grid@11grac1 admin]$ lsnrctl status listener

. . . . . . (slightly)

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.120)(PORT=1521)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.121)(PORT=1521)))

Services Summary...

Service "cdb" has 1 instance(s).

  Instance "cdb1", status READY, has 2 handler(s) for this service...

The command completed successfully

[grid@11grac1 admin]$

SQL> show parameter listener

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

listener_networks                  string

local_listener                        string       (ADDRESS=(PROTOCOL=TCP)(HOST=

                                                           192.168.56.121)(PORT=1521))

remote_listener                     string      rac11g-scan:1521

SQL>

实例2中与监听相关的参数:

[grid@11grac2 admin]$ lsnrctl status listener

。。。。。。(略)

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.122)(PORT=1521)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.123)(PORT=1521)))

Services Summary...

Service "cdb" has 1 instance(s).

  Instance "cdb2", status READY, has 2 handler(s) for this service...

The command completed successfully

[grid@11grac2 admin]$

SQL> show parameter listener

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

listener_networks                  string

local_listener                         string       (ADDRESS=(PROTOCOL=TCP)(HOST=

                                                           192.168.56.123)(PORT=1521))

remote_listener                      string      rac11g-scan:1521

SQL>

在创建第二个监听之前,我们的SCAN监听端口为1521,listener本地监听的端口也是1521。此时,业务系统通过SCAN地址可以正常连接数据库。

2、通过网络抓包的数据可以看出,当客户端通过SCAN地址(192.168.56.124)连接数据库时,先与SCAN地址的1521端口通信,此时SCAN监听将连接请求转发节点2的本地监听(192.168.56.123),节点2本地监听的端口是1521。

3、下面,我们模拟故障重现,创建第二个本地监听,监听名为LISTENER_DG,监听的端口为1522。

[grid@11grac1 ~]$ srvctl add listener -l LISTENER_DG -p 1522 -k 1

[grid@11grac1 ~]$ srvctl start listener -l LISTENER_DG -n 11grac1

[grid@11grac1 ~]$ srvctl start listener -l LISTENER_DG -n 11grac2

[grid@11grac1 ~]$

[grid@11grac1 ~]$ crsctl status resource -t

--------------------------------------------------------------------------------

NAME           TARGET  STATE        SERVER                   STATE_DETAILS      

--------------------------------------------------------------------------------

Local Resources

--------------------------------------------------------------------------------

。。。。。。(略)                                    

ora.LISTENER.lsnr

               ONLINE  ONLINE       11grac1                                     

               ONLINE  ONLINE       11grac2                                     

ora.LISTENER_DG.lsnr

               ONLINE  ONLINE       11grac1                                     

               ONLINE  ONLINE       11grac2

。。。。。。(略)

[grid@11grac1 ~]$ lsnrctl status listener_dg

。。。。。。(略)

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.120)(PORT=1522)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.121)(PORT=1522)))

The listener supports no services

The command completed successfully

[grid@11grac1 ~]$

可以看出,第二个本地监听(LISTENER_DG)已经创建并且启动,但由于监听的端口是1522,所以此时的本地监听(LISTENER_DG)没有注册上任何的service。

4、重启CRS集群,或者手动修改local_listener参数,集群就会自动将本地监听(LISTENER_DG)进行动态注册。如下配置信息是CRS集群重启后,截取的配置信息。

SQL> show parameter listener

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

listener_networks                  string

local_listener                         string       (ADDRESS=(PROTOCOL=TCP)(HOST=

                                                           192.168.56.123)(PORT=1521)), (

                                                          ADDRESS=(PROTOCOL=TCP)(HOST=19

                                                            2.168.56.123)(PORT=1522))

remote_listener                      string      rac11g-scan:1521

SQL>

[grid@11grac2 ~]$ lsnrctl status listener_dg

。。。。。。(略)

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.122)(PORT=1522)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.123)(PORT=1522)))

Services Summary...

Service "cdb" has 1 instance(s).

  Instance "cdb2", status READY, has 1 handler(s) for this service...

。。。。。。(略)

[grid@11grac2 ~]$

可以看出,集群重启后,CRS将第二个本地监听(LISTENER_DG)也进行了动态注册,写入了local_listener参数中。此时第二个本地监听(LISTENER_DG)可以正常对外工作。

5、故障再现。为了模拟故障,我将客户端主机开启防火墙,限制数据库主机的1522端口与客户端主机之间的通信。此时,当客户端通过SCAN地址连接数据库时,就会偶尔出现如下类似的错误。

当SCAN监听将连接请求转发给本地监听时,local_listener参数中,有LISTENER监听的1521端口,同时也有LISTENER_DG监听的1522端口,所以在转发的过程中会按照一定的算法将连接请求分配给这两个本地监听,当将请求分配到LISTENER监听的1521端口时,就能连接成功;而当请求分配到LISTENER_DG监听的1522端口时,由于安全策略未开通1522端口,就会连接失败。这就是为什么业务系统会出现有时能够连接成功,而有时会出现连接失败的原因。

6、为了更加说明这个故障,我此时将客户端防火墙中设置的那条限制访问1522端口的规则删除,再次模拟客户端连接数据库,继续抓包分析。

删除了限制访问1522端口的规则后,客户端可以正常连接数据库。从上面的网络抓包也可以看出,客户端先访问SCAN监听(192.168.56.124)的1521端口,然后SCAN监听将连接请求转发给了LISTENER_DG本地监听的1522端口。

3. 解决方案

通过上面的故障重现,我们已经非常清楚整个故障的触发原因。下面,我们来谈谈对应的解决办法。

方案一:

也即我朋友A所使用的办法。将local_listener中的1522端口条目清除,同时修改listener.ora文件,将第二监听从动态监听修改为静态监听。

SQL> alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.121)(PORT=1521))' scope=both sid='cdb1';

System altered.

SQL> alter system set local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.56.123)(PORT=1521))' scope=both sid='cdb2';

System altered.

SQL>

在listener.ora文件中添加如下内容,SID_NAME根据节点名进行修改:

SID_LIST_LISTENER_DG =

(SID_LIST =

  (SID_DESC =

    (GLOBAL_DBNAME = cdb)

    (ORACLE_HOME = /u01/app/oracle/product/11.2.0.4/dbhome_1)

    (SID_NAME = cdb1)

  )

)

[grid@11grac2 admin]$ lsnrctl status listener_dg

。。。。。。(略)

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.122)(PORT=1522)))

  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.56.123)(PORT=1522)))

Services Summary...

Service "cdb" has 1 instance(s).

  Instance "cdb2", status UNKNOWN, has 1 handler(s) for this service...

The command completed successfully

[grid@11grac2 admin]$

创建的第二个监听(LISTENER_DG)使用静态注册后,客户端通过SCAN地址连接数据库时进行网络抓包。

从上面的网络抓包也可以看出,客户端先访问SCAN监听(192.168.56.124)的1521端口,然后SCAN监听每次都会将连接请求转发给了LISTENER本地监听的1521端口,再也不会转发给LISTENER_DG本地监听的1522端口。

方案二:

利用LISTENER_NETWORKS参数,进行网络分离。具体如下所示:

(1). 在数据库主机的tnsnames.ora中添加如下解析项:

CDB1_LOCAL_NET1 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.121 )(PORT = 1521)))

CDB2_LOCAL_NET1 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.123 )(PORT = 1521)))

CDB1_LOCAL_NET2 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.121 )(PORT = 1522)))

CDB2_LOCAL_NET2 =(DESCRIPTION =(ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.123 )(PORT = 1522)))  

CDB_REMOTE_NET2 =(DESCRIPTION_LIST =(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.121 ) (PORT = 1522)))(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = 192.168.56.123 )(PORT = 1522))))

(2). local_listener和remote_listener置空:

alter system set local_listener='' scope=both sid='*';

alter system set remote_listener='' scope=both sid='*';

(3). 设置LISTENER_NETWORKS参数:

alter system set LISTENER_NETWORKS='((NAME=network1)(LOCAL_LISTENER=CDB1_LOCAL_NET1)(REMOTE_LISTENER=rac11g-scan:1521))','((NAME=network2)(LOCAL_LISTENER=CDB1_LOCAL_NET2)(REMOTE_LISTENER=CDB_REMOTE_NET2))' scope=both sid='cdb1';

alter system set LISTENER_NETWORKS='((NAME=network1)(LOCAL_LISTENER=CDB2_LOCAL_NET1)(REMOTE_LISTENER=rac11g-scan:1521))','((NAME=network2)(LOCAL_LISTENER=CDB2_LOCAL_NET2)(REMOTE_LISTENER=CDB_REMOTE_NET2))' scope=both sid='cdb2';

SQL> show parameter listener

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

listener_networks                  string      ((NAME=network1)(LOCAL_LISTENE

                                                           R=CDB1_LOCAL_NET1)(REMOTE_LIST

                                                           ENER=rac11g-scan:1521)), ((NAM

                                                           E=network2)(LOCAL_LISTENER=CDB

                                                           1_LOCAL_NET2)(REMOTE_LISTENER=

                                                           CDB_REMOTE_NET2))

local_listener                         string

remote_listener                      string

SQL>

设置LISTENER_NETWORKS参数之前,LISTENER_DG本地监听的是1522端口,由于未配置静态监听,所以无法注册任何的service。当设置LISTENER_NETWORKS参数之后,LISTENER_DG本地监听能够动态注册数据库服务。

此时,业务系统通过network1访问数据库时,只会将连接请求分配给1521端口。

Guess you like

Origin blog.csdn.net/royjj/article/details/130701564