High Log file sync performance optimization waiting for actual combat case sharing

Fault conditions

The AWR reports the following:

 

 

 

 

After they stopped most of the business, the Log file sync waiting events were still very high.

By comparing the AWR at the same time yesterday and today, the waiting time is still very high when the business volume is very small.

diagnostic process

The log file sync wait event first judges whether there is a problem with the current system IO, checks the operating system log for no related errors, executes the IO test, and also shows that the IO is in a normal state, and checks the AWR report in detail. AWR shows that the database write IO and read IO are relatively good normal. Comparing yesterday's AWR report with today's AWR report, it can be seen that the IO read and write performance is not much different from yesterday's. The time period collected today is due to the suspension of most of the business. The IO is better than yesterday, but the wait for log switching Instead, the time will increase by 8 seconds.

 According to the Commit log file sync flow in the "Log file sync, LGWR" diagnostic manual written by senior Tanel Poder

 

From the figure above, we can clearly see the whole commit process.

1. When the user initiates a commit;

2. The front-end process (that is, the Server process) will post a message to the lgwr process, telling it to write redo buffer.

3. When the LGWR process is instructed, it starts to call the operating system function to perform physical writing. During the period of physical writing, there will be log file parallel write waiting.

4. When LGWR completes the wrtie operation, the LGWR process will return a message to the front-end process (Server process), telling it that I have finished writing and you can complete the submission.

5. The user completes the commit operation.

From the flow chart above, combined with the short waiting time of log file parallel write, user IO and system IO in AWR, we can judge that the problem does not lie in IO performance. Then the problem should be in stage 2 and stage 5.

According to the mos article:

Troubleshooting: 'Log file sync' Waits (Doc ID 1376916.1)

Adaptive Log File Sync Optimization (Doc ID 1541136.1)

Adaptive Switching Between Log Write Methods can Cause 'log file sync' Waits (Doc ID 1462942.1)

After discovering Oracle11g, Oracle has introduced a new log synchronization method, called adaptive method, which is enabled after Oracle 11.2.0.3 version

Initially the LGWR uses post/wait and according to an internal algorithm evaluates whether polling is better. Under high system load polling may perform better because the post/wait implementation typically does not scale well. If the system load is low, then post/wait performs well and provides better response times than polling. Oracle relies on internal statistics to determine which method should be used.  Because switching between post/wait and polling incurs an overhead, safe guards are in place in order to ensure that switches do not occur too frequently.

According to the description in the document, I checked the load of the system and found that the business system was under a high load at 14 o'clock today, and the log switch reached 136 times, after which the business returned to normal

 

Check the current system and find that the current system LGWR is using the Polling method.

SQL> select name,value from v$sysstat where name in ('redo synch poll writes','redo synch polls');

redo synch poll writes   6248

redo synch polls        8843

Failure Cause Analysis

Since the load of the business system becomes very high at 14 o'clock, Oracle switches the LGWR write mode to Polling mode to maintain higher performance, but when the business load drops, under normal circumstances LGWR should switch back to the original Post/ The wait  method, but this time there is no switch, and the polling method has been used. Judging that there is a bug and there is no switching, which leads to poor performance ( the interval of polling in online articles is 10ms, and the interval of Post/wait  is 1~2ms. I can’t find articles on mos that have written this time)

Related bugs:

 

On version 11.2.03, DATABASE PATCH SET UPDATE 11.2.0.3.14 (INCLUDES CPUAPR2015) can be applied to solve this BUG.

solution

Workaround

Disable adaptive log file sync by setting "_use_adaptive_log_file_sync"=false

ALTER SYSTEM SET "_use_adaptive_log_file_sync"=FALSE ;

Or restarting the instance can also solve it.

Guess you like

Origin blog.csdn.net/m0_37723088/article/details/130998306