log file switch (checkpoint incomplete) - easily misdiagnosed event

This switched https://blogs.oracle.com/database4cn/log-file-switch-checkpoint-incomplete-%e5%ae%b9%e6%98%93%e8%a2%ab%e8%af%af%e8 % af% 8a% e7% 9a% 84event
 
A first look at the following AWR, 12 minute sampling, DB Time 105 minutes.
DB Name DB Id Instance Inst num Startup Time Release RAC
R11204 2114874159 R11204 1 23-Oct-17 10:10 11.2.0.4.0 NO
 
Host Name Platform CPUs Cores Sockets Memory (GB)
nascds18 Linux x86 64-bit 2 2 1 11.64
 
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 3 23-Oct-17 10:55:46 37 2.5
End Snap: 4 23-Oct-17 11:08:27 53 2.3
Elapsed:   12.67 (mins)    
DB Time:   105.90 (mins)
 
Top event discovery buffer busy waits and log file switch (checkpoint incomplete) takes up almost all of the DB Time
Top 10 Foreground Events by Total Wait Time
 
Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
buffer busy waits 11 3310.6 300960 52.1 Concurrency
log file switch (checkpoint incomplete) 10 3034.8 303479 47.8 Configuration
DB CPU   5.5   .1  
log file sync 28 2.3 82 .0 Commit
log buffer space 24 .8 32 .0 Configuration

 

By ASH difficult to find buffer busy waits are log file switch (checkpoint incomplete) obstruction, and log file switch (checkpoint incomplete) has been blocked LGWR read the control file sequential.
2017-10-23 11:05:31.563 1 37 perl@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 150
2017-10-23 11:05:31.563 1 150 sqlplus@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 141
2017-10-23 11:05:31.563 1 141 OMS log file switch (checkpoint incomplete) WAITING 1 130
2017-10-23 11:05:31.563 1 130 oracle@nascds18 (LGWR) control file sequential read WAITING NO HOLDER

 

Then look back at AWR's iostat, LGWR soon found a large number of read and are reading the control file, read in 12 minutes up to 1.5G.
IOStat by Function summary
 
Function Name Reads: Data
LGWR 1.5G
Others 210M
DBWR 0M
Buffer Cache Reads 10M
Direct Writes 0M
TOTAL: 1.7G
 
IOStat by Filetype summary
 
Filetype Name Reads: Data
Control File 1.5G
Log File 185M
Archive Log 0M
Data File 10M
Temp File 0M
TOTAL: 1.7G

 

ASH combination of LGWR is the ultimate holder, and the fact that LGWR wait control file sequential read, we may soon be LGWR as a key suspect, if so, it went wide.
 
This time may wish to stop and think about what is the log file switch (checkpoint incomplete)?
 
log file switch (checkpoint incomplete) refers to the need to redo down when a set of redo Group handover, the group found that the log is active, that is the group number of the log buffer cache corresponding dirty block has not write the data file, it is necessary to wait for the dirty blocks is finished, it may be multiplexed with the next set of redo group.
 
Then think again, what process is responsible for the dirty block is written to the data file? The answer is DBWn, therefore we need to focus on DBWn. Analyze OSWatcher state DBW0 is not difficult to find D, man about ps command: D refers Uninterruptible sleep (usually IO), that is to say DBW0 hang on I / O, and this is the fundamental cause of the problem.
 
So why are so many LGWR performs control file sequential read? The answer is DBWn problems cause LGWR constantly poll the control file for the redo state (look for the handover success) lead to a lot of control file sequential read
 
The following experiments to demonstrate this phenomenon:
 
Test on 12.1.0.2
 
Session 1: Create a table, and update it allowed to generate redo
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> select sid from v$mystat where rownum=1;
 
      SID
----------
       11   <<<< session 1的 sid 为11
 
SQL> create table t as select * from dba_objects;
 
Table created.
 
SQL> update t set object_name=object_name;
 
93841 rows updated.

 

Session 2: a view of the switching state redo
SQL> set line 200 pages 1000
SQL> select group#,thread#,status from v$log;
 
   GROUP#    THREAD# STATUS
---------- ---------- ----------------
        1          1 INACTIVE
        2          1 INACTIVE
        3          1 CURRENT
 
SQL> alter system switch logfile;
 
System altered.
 
SQL> alter system switch logfile;
 
System altered.
 
SQL> select group#,thread#,status from v$log;
 
   GROUP#    THREAD# STATUS
---------- ---------- ----------------
        1          1 ACTIVE     <<<<<< 让下一组是active的
        2          1 ACTIVE
        3          1 CURRENT

 

Session 3: determining the DBWn and LGWR spid, and by the DBW0 suspend oradebug allowed to stay does not work
SQL> set line 200 pages 1000
SQL> select spid,program from v$process where program like '%DBW%' or program like '%LG%';
 
SPID                     PROGRAM
------------------------ ----------------------------------------------------------------
5768                     ORACLE.EXE (DBW0)
7248                     ORACLE.EXE (LGWR)
6384                     ORACLE.EXE (LG00)
6308                     ORACLE.EXE (LG01)
 
SQL> oradebug setospid 5768
Oracle pid: 11, Windows thread id: 5768, image: ORACLE.EXE (DBW0)
SQL> oradebug suspend
Statement processed.

 

Next update performed several times again in Session 1 allowed automatic switching log:
Session 1:
SQL> update t set object_name=object_name;
 
93841 rows updated.
 
SQL> /
 
93841 rows updated.
 
SQL> /
 
93841 rows updated.
 
SQL> /   ---- 在这里hang住了

 

Session 4: watchful waiting event
SQL> set line 200 pages 1000
SQL> col program for a15
SQL> col event for a40
SQL> select sid,serial#,program,event,state from v$session where program like '%sqlplus%';
 
      SID    SERIAL# PROGRAM         EVENT                                    STATE
---------- ---------- --------------- ---------------------------------------- -------------------
       10      33682 sqlplus.exe     SQL*Net message from client              WAITING
       11      48189 sqlplus.exe     log file switch (checkpoint incomplete)  WAITING
      129      25471 sqlplus.exe     SQL*Net message to client                WAITED SHORT TIME
      130      64963 sqlplus.exe     SQL*Net message from client              WAITING

 

Next we do 10046 trace of LGWR, observe the waiting: Visible control file sequential read repeatedly
SQL> oradebug setospid 7248
Oracle pid: 12, Windows thread id: 7248, image: ORACLE.EXE (LGWR)
SQL> oradebug event 10046 trace name context forever, level 8
Statement processed.
SQL> oradebug tracefile_name
D:\ORACLE\diag\rdbms\r12102\r12102\trace\r12102_lgwr_7248.trc
 
tail -f D:\ORACLE\diag\rdbms\r12102\r12102\trace\r12102_lgwr_7248.trc
 
WAIT #0: nam='LGWR all worker groups' ela= 72 p1=0 p2=0 p3=0 obj#=-1 tim=25178622234
WAIT #0: nam='control file sequential read' ela= 407 file#=0 block#=1 blocks=1 obj#=-1 tim=25178622880
WAIT #0: nam='control file sequential read' ela= 262 file#=1 block#=1 blocks=1 obj#=-1 tim=25178623344
WAIT #0: nam='control file sequential read' ela= 717 file#=0 block#=15 blocks=1 obj#=-1 tim=25178624315
WAIT #0: nam='control file sequential read' ela= 1774 file#=0 block#=17 blocks=1 obj#=-1 tim=25178626427
WAIT #0: nam='control file sequential read' ela= 311 file#=0 block#=19 blocks=1 obj#=-1 tim=25178627527
WAIT #0: nam='control file sequential read' ela= 269 file#=0 block#=284 blocks=1 obj#=-1 tim=25178627983
WAIT #0: nam='control file sequential read' ela= 238 file#=0 block#=22 blocks=1 obj#=-1 tim=25178628363
WAIT #0: nam='LGWR all worker groups' ela= 51 p1=0 p2=0 p3=0 obj#=-1 tim=25178628590
WAIT #0: nam='control file sequential read' ela= 503 file#=0 block#=1 blocks=1 obj#=-1 tim=25178629320
WAIT #0: nam='control file sequential read' ela= 322 file#=1 block#=1 blocks=1 obj#=-1 tim=25178630389
WAIT #0: nam='control file sequential read' ela= 276 file#=0 block#=15 blocks=1 obj#=-1 tim=25178630864
WAIT #0: nam='control file sequential read' ela= 253 file#=0 block#=17 blocks=1 obj#=-1 tim=25178631286
WAIT #0: nam='control file sequential read' ela= 250 file#=0 block#=19 blocks=1 obj#=-1 tim=25178631696
WAIT #0: nam='control file sequential read' ela= 658 file#=0 block#=284 blocks=1 obj#=-1 tim=25178632935
WAIT #0: nam='control file sequential read' ela= 303 file#=0 block#=22 blocks=1 obj#=-1 tim=25178633812
......

 

Finally, we will DBWn recovery, the end of this infinite control file sequential read wait.
Back to Session 3, resume DBW0:
SQL> oradebug resume
Statement processed. 
Once you understand the principle, this problem would have been difficult to diagnose, but ASH and AWR DBA information will be wrong to think the problem LGWR wait control file sequential read due.

Guess you like

Origin www.cnblogs.com/zylong-sys/p/12015975.html