<Switch> [Case Study] pressure measured TPS falters

1.       Description of the problem:

A business-critical systems on new customers, when the pressure before the cook line testing, application concurrency can not achieve concurrent indicators on the front of the line and the response time of the indicator. When the measured pressure curve TPS very unstable, as follows:

 

QQ picture 20170411094541.png

 

 

2.       analysis:

From the above knowledge may know:

ORACLE LGWR process in only one, since all processes are required before the commit process to help inform lgwr the previously generated in the log buffer to modify the process of recording (change vector) written to disk.

When a large number of processes to simultaneously please help write lgwr process, the queuing situation arises.

In highly concurrent online transaction OLTP system, lgwr process a single process has the potential to become a major bottleneck, especially in the case of online journals can not write IO performance problems.

Therefore, we need to check the status of lgwr process.

Where gv $ session write log was observed by two nodes RAC lgwr process, the results shown below:

 

 

QQ picture 20170411094640.png

 

 

 

can be seen:

Ø RAC (database cluster) two nodes, only one node log file parallel write wait appears, which represents lgwr process is waiting on the disk online log file for writing.

Ø In the case of state is waiting, the node 1 log file parallel waiting seq # is 35693, but seconds_in_wait reached 21 seconds. Simply put, the process of writing a lgwr IO needs 21 seconds!

This means that, when the pressure test all concurrent processes must take place to wait, to complete this process and other lgwr of IO, notify the LGWR process can continue to help change the brush vector log buffer, so the pressure curve from the TPS measurement point of view, is unstable , there has been a substantial attenuation.

At this point, we can be sure, IO subsystem in question

It is important to the optical fiber under investigation IO path, SAN switches, storage and performance given situation. ,

Taking into account the client side managed storage team / department may not admit evidence of slow database IO, and in order to allow the other side to increase the intensity of the investigation, far from the state to allow customers to issue the following command to view the IO multi-pathing software, and the results shown below :

 

QQ picture 20170411094841.png

 

 

 

1 node appears on the apparent IO ERROR, and continues to grow!

Continue to check node 2, node 2 is not found on any IO ERROR!

The gv $ session with only one process and other log file parallel write is written entirely consistent.

3.       reasons

In the face of evidence of iron, the customer's storage team no longer struggle, but began in earnest in the investigation and one by one, finally satisfactorily resolved after replacing the fiber optic cable problem. The following is a fiber optic line again after replacing the pressure measured waiting for the event!

4. The       issue is resolved

TPS measured pressure curve from the original wave form

 

QQ picture 20170411094933.png

 

 

Become a good profile as

 

QQ picture 20170411095040.png

Guess you like

Origin www.cnblogs.com/1737623253zhang/p/11576519.html