Fix high "latch: cache buffers chains" waits

This is the first part in a series of articles entitled "The right tool for the right job". In each article, I will use tools and ideas made available to us by oracle gurus like Tanel Poder, Jonathan Lewis, Tom Kyte, Kerry Osborne and many others. I will try to show you how to use those available free tools for troubleshooting real life problem I've encountered at client sites. You can use the scripts "as-is" (note that you should always test any script before using it on production systems) or modify them to suit your needs or even write your own based on ideas found on original scripts.

Some background first, database is 10.2.0.4 running in a SUN M9000 dynamic domain with 128Go of memory and 32 CPU (2520 MHz) allocated using fair share scheduler (FSS). We are experiencing high "latch: cache buffers chains" (CBC) contention issue. The problem occurs intermittently (several times a day) and also resolves by itself. Here is an overview of the issue with Quest Perfomance Analysis (as the client pays for it, let's use it to show a pretty graph ^_^) :

AAS during latch contention

CBC latches are represented in pink color, I did the whole analysis during 10:00 and 10:15 AM, when the problem was currently happening.

What is a latch?
Latches are simple, low-level serialization mechanisms to protect shared SGA data structures and shared code segments (such as various internal linked list modifications, shared pool memory allocation, library cache object lookups and so on...) from simultaneous session access. They are designed to be very quickly acquired and freed. Latches are very low-level locks, inaccessible to users or applications as they cannot directly acquire nor release them. They are similar in purpose to locks: Latches protect internal memory structures while locks protect data structures.

The database buffer cache is the portion of the SGA that holds copies of data blocks read from datafiles so they can be accessed faster than by reading them off disks.

We can usually get more information about wait events by checking P1, P2 and P3. We will then look up on V$EVENT_NAME what P1, P2 and P3 mean for "latch: cache buffers chains".

SQL> select PARAMETER1 P1, PARAMETER2 P2, PARAMETER3 P3
  2  from v$event_name
  3  where name = 'latch: cache buffers chains';
 
P1         P2         P3
---------- ---------- ----------
address    number     tries

From the above query, P1 is the address of the latch for the cbc latch wait. Now, we can query ASH by grouping the CBC latch waits by the address and find out what address is experiencing the most contention (=waits).

SQL> select * from ( 2 select 3 event, 4 trim(to_char(p1, 'XXXXXXXXXXXXXXXX')) latch_addr, 5 trim(round(ratio_to_report(count(*)) over () * 100, 1))||'%' pct, 6 count(*) 7 from 8 v$active_session_history 9 where 10 event = 'latch: cache buffers chains' 11 and session_state = 'WAITING' 12 group by event,p1 13 order by count(*) desc 14 ) 15 where rownum <= 10; EVENT LATCH_ADDR PCT COUNT(*) -------------------------------------------------- ----------------- -------- ---------- latch: cache buffers chains 967757968 78.8% 178 latch: cache buffers chains 964F892C0 2.7% 6 latch: cache buffers chains CE2984B50 2.2% 5 latch: cache buffers chains 9676FC808 1.3% 3 latch: cache buffers chains 963019308 1.3% 3 latch: cache buffers chains 963DE63F0 .9% 2 latch: cache buffers chains 9655A1E38 .9% 2 latch: cache buffers chains 966F65E28 .9% 2 latch: cache buffers chains 963276CE8 .9% 2 latch: cache buffers chains CE1AD8E38 .9% 2 10 rows selected. SQL>

Looking at the highlighted line, 78.8% of the waits on "latch: cache buffers chains" involves child latch at address 967757968.

Now that we know the actual CBC child latch address having the most waits, we can use Tanel Poder's latchprofx script (http://files.e2sn.com/scripts/latchprofx.sql) to find the root cause of the contention. We will monitor the holders (sid and sql_id) of this particular child latch. The column called "object", only available in x$ksuprlat shows information about the object protected by a given latch. For cache buffers chains latches, this column shows the Data Block Address (DBA) of the block that we accessed, causing the latch get.

SQL> @latchprofx sid,name,sqlid,object % 967757968 10000
 
-- LatchProfX 2.00 by Tanel Poder ( http://www.tanelpoder.com )
 
   SID NAME                        SQLID                  OBJECT       Held       Gets  Held %     Held ms Avg hold ms
------ --------------------------- ------------- --------------- ---------- ---------- ------- ----------- -----------
  1258 cache buffers chains        amnf5uaxyn12c         31B7E54        100        100    1.00     442.500       4.425
  3077 cache buffers chains        amnf5uaxyn12c         31B7E54         97         97     .97     429.225       4.425
  3004 cache buffers chains        amnf5uaxyn12c         31B7E54         88         88     .88     389.400       4.425
  3082 cache buffers chains        amnf5uaxyn12c         31B7E54         50         50     .50     221.250       4.425
  2947 cache buffers chains        amnf5uaxyn12c         31B7E54         13         13     .13      57.525       4.425
   967 cache buffers chains        17qcz644sgfzk         31B7E54          4          4     .04      17.700       4.425
  3176 cache buffers chains        17qcz644sgfzk         31B7E54          4          4     .04      17.700       4.425
  3000 cache buffers chains        17qcz644sgfzk         31B7E54          2          2     .02       8.850       4.425
  1165 cache buffers chains        17qcz644sgfzk         31B7E54          2          2     .02       8.850       4.425
  1037 cache buffers chains        17qcz644sgfzk         31B7E54          2          2     .02       8.850       4.425
  2950 cache buffers chains        fb8phc7ms9s5u         31B7E54          2          2     .02       8.850       4.425
  3077 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
  2950 cache buffers chains        50p77b0zpx5cx         31B7E54          1          1     .01       4.425       4.425
  2933 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
  2633 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
  2594 cache buffers chains        985rq8bkb254j         31B7E54          1          1     .01       4.425       4.425
  2557 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
  1284 cache buffers chains        985rq8bkb254j         31B7E54          1          1     .01       4.425       4.425
  1240 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
  1026 cache buffers chains        17qcz644sgfzk         31B7E54          1          1     .01       4.425       4.425
 
20 rows selected.

The block address accessed is 31B7E54 and the sessions holding the child latch the most were executing the query with sql_id amnf5uaxyn12c.
Next, we will convert that data block address to file#, block# using dbms_utility and try to find to which segment it belongs querying x$bh (bh stands for blocks headers). Once again, Tanel wrote a script for that (dba.sql standing for Data Block Address^^). You will find it in tpt_public_unixmac.tar.gz available on Tanel's new website.

SQL> @dba 31B7E54
 
    RFILE#     BLOCK#
---------- ----------
        12    1801812
 
Press enter to find the segment using V$BH (this may take CPU time), CTRL+C to cancel:
 
STATE      BLOCK_CLASS        OBJECT_TYPE      object                                TCH  MODE_HELD D T P S D FLG_LRUFLG          DQ
---------- ------------------ ---------------- ------------------------------ ---------- ---------- - - - - - -------------- -------
xcur       data block         INDEX            TAU$OWNER.ITRN2TAU                     17          1 N N N N N 2000:8               0
 
Press enter to query what segment resides there using DBA_EXTENTS (this can be IO intensive), CTRL+C to cancel:
^C
 
SQL>

To solve the problem once and for all, we must find the SQL and why is application hitting the block so hard.

We can get sql_fulltext of a given sql_id either from V$SQL or V$SQLAREA or V$SQLSTATS. I personally prefer getting it from V$SQLSTATS because it is faster, more scalable, and has a greater data retention (the statistics may still appear in this view, even after the cursor has been aged out of the shared pool).

I have a little script for that : sqltext.sql

SQL> @sqltext amnf5uaxyn12c
Wrote file /tmp/sqlplus_settings
 
SQL_FULLTEXT
------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT DATEFT, CODAPPOPE, NUMTRAOPE, NUMTRN, MNEFLD, MNEPTF, INIRSPFRT, MNECPAFIN, NUMHDRPDT FROM TTRNTAU T1 WHERE T1.CODAPPOPE = :B2 AND T1.NUMTRAOPE
= :B1 AND T1.DATEFT = (SELECT /*+ORDERED*/ MAX(T2.DATEFT) FROM TTRNTAU T2, THDRPDTTAU HDP WHERE T2.CODAPPOPE = T1.CODAPPOPE AND T2.NUMTRAOPE =
T1.NUMTRAOPE AND T2.NUMHDRPDT = HDP.NUMHDRPDT AND ((HDP.TYPCPAREF != 'PF' AND T2.MNECPATRN = HDP.MNECPAREF) OR (HDP.TYPCPAREF = 'PF' AND T2.MNECPATRN
= :B3 )) AND T2.CATTRN != 'ANNTEC' AND 0 = (SELECT MOD(COUNT(* ),2) FROM TTRNTAU T3 START WITH T3.CODAPPOPE = T2.CODAPPOPE AND T3.NUMTRAOPE =
T2.NUMTRAOPE AND T3.CATTRN = 'ANNTEC' AND T3.NUMTRNANU = T2.NUMTRN CONNECT BY T3.CODAPPOPE = T2.CODAPPOPE AND T3.NUMTRAOPE = T2.NUMTRAOPE AND
T3.CATTRN = 'ANNTEC' AND T3.NUMTRNANU = PRIOR T3.NUMTRN)) AND ((T1.NATTRN = 'CRE' AND NVL(T1.TYPSOUTRN,'XXX') NOT IN ('ANNPAR','ANNTOT')) OR
(T1.NATTRN = 'ANN' AND (NVL(T1.TYPSOUTRN,'XXX') IN ('ANNPAR','ANNTOT') OR NOT EXISTS (SELECT NULL FROM TTRNTAU T3 WHERE T3.CODAPPOPE = T1.CODAPPOPE
AND T3.NUMTRAOPE = T1.NUMTRAOPE AND T3.DATEFT = T1.DATEFT AND T3.NATTRN = 'CRE'))))
SQL>

 

If you have T.O.A.D, one useful feature is SQL formatting tool (don't get me wrong, I'm no big fan of GUI tools, I use SQL*PLUS 99% of the time ;-)

/* Formatted on 2010/04/16 16:17 (Formatter Plus v4.8.8) */
SELECT dateft, codappope, numtraope, numtrn, mnefld, mneptf, inirspfrt,
       mnecpafin, numhdrpdt
  FROM ttrntau t1
 WHERE t1.codappope = :b2
   AND t1.numtraope = :b1
   AND t1.dateft =
          (SELECT /*+ORDERED*/
                  MAX (t2.dateft)
             FROM ttrntau t2, thdrpdttau hdp
            WHERE t2.codappope = t1.codappope
              AND t2.numtraope = t1.numtraope
              AND t2.numhdrpdt = hdp.numhdrpdt
              AND (   (hdp.typcparef != 'PF' AND t2.mnecpatrn = hdp.mnecparef
                      )
                   OR (hdp.typcparef = 'PF' AND t2.mnecpatrn = :b3)
                  )
              AND t2.cattrn != 'ANNTEC'
              AND 0 =
                     (SELECT     MOD (COUNT (*), 2)
                            FROM ttrntau t3
                      START WITH t3.codappope = t2.codappope
                             AND t3.numtraope = t2.numtraope
                             AND t3.cattrn = 'ANNTEC'
                             AND t3.numtrnanu = t2.numtrn
                      CONNECT BY t3.codappope = t2.codappope
                             AND t3.numtraope = t2.numtraope
                             AND t3.cattrn = 'ANNTEC'
                             AND t3.numtrnanu = PRIOR t3.numtrn))
   AND (   (    t1.nattrn = 'CRE'
            AND NVL (t1.typsoutrn, 'XXX') NOT IN ('ANNPAR', 'ANNTOT')
           )
        OR (    t1.nattrn = 'ANN'
            AND (   NVL (t1.typsoutrn, 'XXX') IN ('ANNPAR', 'ANNTOT')
                 OR NOT EXISTS (
                       SELECT NULL
                         FROM ttrntau t3
                        WHERE t3.codappope = t1.codappope
                          AND t3.numtraope = t1.numtraope
                          AND t3.dateft = t1.dateft
                          AND t3.nattrn = 'CRE')
                )
           )
       )

Let's have a look at the execution plan of this query using dbms_xplan.display_cursor:

SQL> @sqlplan amnf5uaxyn12c
Wrote file /tmp/sqlplus_settings
SQL_ID  amnf5uaxyn12c, child number 0
-------------------------------------
SELECT DATEFT, CODAPPOPE, NUMTRAOPE, NUMTRN, MNEFLD, MNEPTF, INIRSPFRT, MNECPAFIN,
NUMHDRPDT FROM TTRNTAU T1 WHERE T1.CODAPPOPE = :B2 AND T1.NUMTRAOPE = :B1 AND T1.DATEFT
= (SELECT /*+ORDERED*/ MAX(T2.DATEFT) FROM TTRNTAU T2, THDRPDTTAU HDP WHERE
T2.CODAPPOPE = T1.CODAPPOPE AND T2.NUMTRAOPE = T1.NUMTRAOPE AND T2.NUMHDRPDT =
HDP.NUMHDRPDT AND ((HDP.TYPCPAREF != 'PF' AND T2.MNECPATRN = HDP.MNECPAREF) OR
(HDP.TYPCPAREF = 'PF' AND T2.MNECPATRN = :B3 )) AND T2.CATTRN != 'ANNTEC' AND 0 =
(SELECT MOD(COUNT(* ),2) FROM TTRNTAU T3 START WITH T3.CODAPPOPE = T2.CODAPPOPE AND
T3.NUMTRAOPE = T2.NUMTRAOPE AND T3.CATTRN = 'ANNTEC' AND T3.NUMTRNANU = T2.NUMTRN
CONNECT BY T3.CODAPPOPE = T2.CODAPPOPE AND T3.NUMTRAOPE = T2.NUMTRAOPE AND T3.CATTRN =
'ANNTEC' AND T3.NUMTRNANU = PRIOR T3.NUMTRN)) AND ((T1.NATTRN = 'CRE' AND
NVL(T1.TYPSOUTRN,'XXX') NOT IN ('ANNPAR','ANNTOT')) OR (T1.NATTRN = 'ANN' AND
(NVL(T1.TYPSOUTRN,'XXX') IN ('ANNPAR','ANNTOT') OR NOT EXISTS (SELECT NULL FROM TTRNTAU
T3 WHERE T3.CODAPPOP
 
Plan hash value: 311289460
--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name          | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|*  1 |  FILTER                            |               |        |       |            |          |       |       |          |
|   2 |   HASH GROUP BY                    |               |      1 |   144 |    17   (6)| 00:00:01 |   773K|   773K|  978K (0)|
|*  3 |    FILTER                          |               |        |       |            |          |       |       |          |
|   4 |     MERGE JOIN CARTESIAN           |               |      1 |   144 |    11   (0)| 00:00:01 |       |       |          |
|   5 |      NESTED LOOPS                  |               |      1 |    75 |     7   (0)| 00:00:01 |       |       |          |
|*  6 |       TABLE ACCESS BY INDEX ROWID  | TTRNTAU       |      1 |    58 |     5   (0)| 00:00:01 |       |       |          |
|*  7 |        INDEX RANGE SCAN            | ITRN2TAU      |      1 |       |     4   (0)| 00:00:01 |       |       |          |
|*  8 |       TABLE ACCESS BY INDEX ROWID  | THDRPDTTAU    |      1 |    17 |     2   (0)| 00:00:01 |       |       |          |
|*  9 |        INDEX UNIQUE SCAN           | PK_THDRPDTTAU |      1 |       |     1   (0)| 00:00:01 |       |       |          |
|  10 |      BUFFER SORT                   |               |      1 |    69 |    10  (10)| 00:00:01 |  2048 |  2048 | 2048  (0)|
|  11 |       TABLE ACCESS BY INDEX ROWID  | TTRNTAU       |      1 |    69 |     4   (0)| 00:00:01 |       |       |          |
|* 12 |        INDEX RANGE SCAN            | ITRN2TAU      |      1 |       |     3   (0)| 00:00:01 |       |       |          |
|* 13 |     TABLE ACCESS BY INDEX ROWID    | TTRNTAU       |      1 |    22 |     5   (0)| 00:00:01 |       |       |          |
|* 14 |      INDEX RANGE SCAN              | ITRN2TAU      |      1 |       |     4   (0)| 00:00:01 |       |       |          |
|  15 |      SORT AGGREGATE                |               |      1 |    28 |            |          |       |       |          |
|* 16 |       CONNECT BY WITH FILTERING    |               |        |       |            |          |  1024 |  1024 |          |
|* 17 |        TABLE ACCESS BY INDEX ROWID | TTRNTAU       |      1 |    50 |     5   (0)| 00:00:01 |       |       |          |
|* 18 |         INDEX RANGE SCAN           | ITRN2TAU      |      1 |       |     4   (0)| 00:00:01 |       |       |          |
|  19 |        NESTED LOOPS                |               |        |       |            |          |       |       |          |
|  20 |         CONNECT BY PUMP            |               |        |       |            |          |       |       |          |
|* 21 |         TABLE ACCESS BY INDEX ROWID| TTRNTAU       |      1 |    28 |     5   (0)| 00:00:01 |       |       |          |
|* 22 |          INDEX RANGE SCAN          | ITRN2TAU      |      1 |       |     4   (0)| 00:00:01 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
1 - filter("DATEFT"=MAX("T2"."DATEFT"))
3 - filter(((("T1"."NATTRN"='CRE' AND NVL("T1"."TYPSOUTRN",'XXX')<>'ANNPAR' AND
NVL("T1"."TYPSOUTRN",'XXX')<>'ANNTOT') OR ("T1"."NATTRN"='ANN' AND (NVL("T1"."TYPSOUTRN",'XXX')='ANNTOT' OR
NVL("T1"."TYPSOUTRN",'XXX')='ANNPAR' OR  IS NULL))) AND =0))
6 - filter("T2"."CATTRN"<>'ANNTEC')
7 - access("T2"."CODAPPOPE"=:B2 AND "T2"."NUMTRAOPE"=:B1)
8 - filter((("T2"."MNECPATRN"="HDP"."MNECPAREF" AND "HDP"."TYPCPAREF"<>'PF') OR ("HDP"."TYPCPAREF"='PF' AND
"T2"."MNECPATRN"=:B3)))
9 - access("T2"."NUMHDRPDT"="HDP"."NUMHDRPDT")
12 - access("T1"."CODAPPOPE"=:B2 AND "T1"."NUMTRAOPE"=:B1)
13 - filter("T3"."NATTRN"='CRE')
14 - access("T3"."CODAPPOPE"=:B1 AND "T3"."NUMTRAOPE"=:B2 AND "T3"."DATEFT"=:B3) filter("T3"."DATEFT"=:B1)
16 - access("T3"."NUMTRNANU"=PRIOR NULL)
17 - filter(("T3"."NUMTRNANU"=:B1 AND "T3"."CATTRN"='ANNTEC'))
18 - access("T3"."CODAPPOPE"=:B1 AND "T3"."NUMTRAOPE"=:B2)
21 - filter(("T3"."NUMTRNANU"=PRIOR NULL AND "T3"."CATTRN"='ANNTEC'))
22 - access("T3"."CODAPPOPE"=:B1 AND "T3"."NUMTRAOPE"=:B2)
 
Note
-----
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
 
72 rows selected.
 
SQL>

The operation connect by with filtering (lines 16 to 22 of the plan) is used to process hierarchical queries. It is characterized by two child operations. The first one is used to get the root of the hierarchy, and the second one is executed once for each level in the hierarchy. Each of those two child operations includes an index range scan of itrn2tau.

Looking at lines 4 to 12 of the plan, it is assuming that the result of the nested loops (lines 5 to 9) between ttrntau (accessing index itrn2tau) and thdrpdttau will return one row. It is also assuming that accessing table ttrntau by index itrn2tau will return 1 row as well (line 11-12). Then it just puts those two rows together with a cartesian product and joins that result to table ttrntau accessed by index itrn2tau (line 13-14) with a filter operation (line 3). It is because of the low cardinality expected that it went using a merge join cartesian. It is a perfectly valid approach and is very fast here.

The filter operation used for last join is very similar to a nested loop join - for each row in the driving table, Oracle operates the filter condition to determine whether or not to keep a row. However, filter can be much more efficient than nested loop, because it can remember results of previous probes into the second table - effectively making the probe an in-memory lookup.

As seen previously, for a single query execution, we access index itrn2tau several times. Furthermore, this query was executed concurrently by many sessions, that is why the latches were hammered so much. This is all about scalability, the query is running fine on a single user system but does not scale when run concurrently by hundreds of users.

Fortunately, this query was in-house developed so I asked the dev team to work on it.


猜你喜欢

转载自czmmiao.iteye.com/blog/1544682