Waiting for a long time for latch free——remember the diagnosis process of a system abnormality

Reprinted: http://czmmiao.iteye.com/blog/1767091

Today, I found that SQL in a report database is running abnormally, and briefly record the process of diagnosing and solving the problem.
The problem was discovered while examining the ALERT file, a procedure took too long to run with an ORA-1555 error.
Error message:
ORA-01555 caused by SQL statement below (Query Duration=38751 sec, SCN: 0x0000.fe5b584a):
INSERT INTO MAN_ORDER_ITEM (
ID,
REQUEST_QTY,
SALER_ID,
PRODUCT_ID,
UNIT_PRICE,
CREATE_DATE,
ANSWER_DATE,
BUYER_ID
)
SELECT
A.RECORD_ID,
A.REQUEST_QTY,
A.SALER_ORGID,
A.PRODUCT_ID, A.UNIT_PRICE,
A.CREATE_DATE
,
NULL,
A.BUYER_ORGID
FROM ORD_ORDER_ITEM A
WHERE A.CREATE_DATE >= TO_DATE('2004-01-01 0:0:0', 'YYYY-MM-DD HH24:MI:SS')
AND A.CREATE_DATE < TRUNC(SYSDATE)
AND EXISTS (SELECT 1 FROM MAN_PRODUCT WHERE ID = A.PRODUCT_ID)
AND EXISTS (SELECT 1 FROM MAN_DEALER WHERE ID = A.SALER_ORGID)
AND EXISTS (SELECT 1 FROM MAN_BUYER WHERE ID = A.BUYER_ORGID)
Since this is a JOB call, JOB will automatically retry after failure, Then view the relevant JOB and SESSION information from DBA_JOBS_RUNNING.
SQL> SELECT SID, JOB FROM DBA_JOBS_RUNNING;
SID JOB
---------- ----------
70 208
Check what SQL is being executed in this SESSION directory:
SQL> SELECT SQL_TEXT FROM V$SQL SQL, V$SESSION S
2 WHERE SQL.HASH_VALUE = S.SQL_HASH_VALUE
3 AND SQL.ADDRESS = S.SQL_ADDRESS
4 AND S.SID = 70;
SQL_TEXT
------------------------------------------------------------------------------
INSERT INTO MAN_ORDER_ITEM ( ID, REQUEST_QTY, SALER_ID, PRODUCT_ID, UNIT_PRICE, CREATE_DATE, ANSWER_DATE,
BUYER_ID ) SELECT A.RECORD_ID, A.REQUEST_QTY, A.SALER_ORGID, A.PRODUCT_ID, A.UNIT_PRICE, A.CREATE_DATE,
NULL, A.BUYER_ORGID FROM ORD_ORDER_ITEM A WHERE A.CREATE_DATE >= TO_DATE('2004-01-01 0:0:0', 'YYYY-MM-DD HH24:MI:S
S') AND A.CREATE_DATE < TRUNC(SYSDATE) AND EXISTS (SELECT 1 FROM MAN_PRODUCT WHERE ID = A.PRODUCT_ID) AND EXISTS (SEL
ECT 1 FROM MAN_DEALER WHERE ID = A.SALER_ORGID) AND EXISTS (SELECT 1 FROM MAN_BUYER WHERE ID = A.BUYER_ORGID)
从SQL上看,就是刚才失败的那个SQL语句,那么看看SESSION在等待什么:
SQL> SELECT SID, EVENT, P1TEXT, P1RAW, P2TEXT, P2, SECONDS_IN_WAIT FROM V$SESSION_WAIT
2 WHERE SID = 70;
SID EVENT P1TEXT P1RAW P2TEXT P2 SECONDS_IN_WAIT
------- ---------- ---- -------- ---------------- ------- ----- ---------- -----
70 latch free address 00000004125AB718 number 98 330
Through observation, it is found that the waiting event of Session is always LATCH FREE. The first feeling is that there may be contention with other processes.
Query the specific latch type that Oracle is waiting for.
SQL> SELECT LATCH#, NAME FROM V$LATCH WHERE LATCH# = 98;
LATCH# NAME
---------- -------------------- --------------------------------------------
98 cache buffers chains
while querying V$LOCK and V$LATCHHOLDER views, found that no other process affects the JOB operation:
SQL> SELECT SID, TYPE, ID1, ID2, LMODE, REQUEST, CTIME, BLOCK
2 FROM V$LOCK
3 WHERE SID > 8;
SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
70 TM 35258 0 3 0 12072 0
70 JQ 0 208 6 0 12155 0
SQL> SELECT * FROM V$LATCHHOLDER;
no rows selected
SQL> SELECT * FROM V$LATCHHOLDER;
no rows selected
SQL> SELECT * FROM V$LATCHHOLDER;
no rows selected
SQL> SELECT * FROM V$LATCHHOLDER;
PID SID LADDR NAME
---------- ---------- ---------------- ------------------------------
15 70 0000000412564D98 cache buffers chains
SQL> SELECT * FROM V$LATCHHOLDER;
no rows selected
可以看到,并没有其他对象影响JOB进程。
Since the wait event is LATCH FREE, the suspicion is related to the problem of the system itself.
You can see the information of the child LATCH that is currently waiting through the following script:
SQL> SELECT ADDR, LATCH#, CHILD#, NAME FROM V$LATCH_CHILDREN
2 WHERE ADDR IN (SELECT P1RAW FROM V$SESSION_WAIT WHERE SID = 70) ;
ADDR LATCH# CHILD# NAME
---------------- ---------- ---------- ------- --------------------------------
0000000412550518 98 327 cache buffers chains
Observe the information of LATCH_MISSES:
SQL> COL PARENT_NAME FORMAT A20
SQL > COL WHERE FORMAT A35
SQL> SELECT *
2 FROM
3 (
4 SELECT PARENT_NAME, "WHERE", SLEEP_COUNT, WTR_SLP_COUNT, LONGHOLD_COUNT
5 FROM V$LATCH_MISSES
6 WHERE PARENT_NAME = 'cache buffers chains'
7 ORDER BY SLEEP_COUNT + WTR_SLP_COUNT + LONGHOLD_COUNT DESC
8 )
9 WHERE ROWNUM < 20;
PARENT_NAME WHERE SLEEP_COUNT WTR_SLP_COUNT LONGHOLD_COUNT
-------------------- ------------------------------- ----------- ------------- --------------
cache buffers chains kcbgtcr: kslbegin excl 1202658 884364 906374
cache buffers chains kcbrls: kslbegin 480030 852471 335799
cache buffers chains kcbzwb 95994 90482 84373
cache buffers chains kcbgtcr: kslbegin shared 89385 84911 62640
cache buffers chains kcbgtcr: fast path 69352 113120 51014
cache buffers chains kcbchg: kslbegin: bufs not pinned 86476 51934 58687
cache buffers chains kcbzsc 76224 55 75045
cache buffers chains kcbbxsv 37425 8306 35681 cache buffers chains kcbchg
: kslbegin: call CR func 1337 20943 745
cache buffers chains kcbzib: finish free chains bufs 685 18544
432 kcbget: pin buffer 542 4508 383 cache buffers chains kcbgtcr 2400 395 1769 cache buffers chains kcbbic1 14 4015 11 cache buffers chains kcbzgb: scan from tail. nowait 2048 0 1896 cache buffers chains kcbbic2 38 2920 32 cache buffers chains kcbzib: multi-block : nowait 1502 0 970 cache buffers chains kcbnew 497 331 289 19 rows selected. I feel that the problem is related to the hot block, so let's see which blocks have the problem:










SQL> SELECT OBJ, OBJECT_NAME, TCH, TIM
2 FROM X$BH A, DBA_OBJECTS B
3 WHERE HLADDR IN (SELECT P1RAW FROM V$SESSION_WAIT WHERE SID = 70)
4 AND A.OBJ = B.DATA_OBJECT_ID;
OBJ OBJECT_NAME TCH TIM
---------- ------------------------------ ---------- ----------
109 I_OBJAUTH2 0 0
45761 STATS$SQLTEXT_PK 1 1174381376
.
.
62275 ORD_ORDER_ITEM_ZJ 1 1174380112
62275 ORD_ORDER_ITEM_ZJ 1 1174380085
.
.
200403 ORD_ORDER_ITEM 6 1174381870
200403 ORD_ORDER_ITEM 6 1174381870
200403 ORD_ORDER_ITEM 6 1174381870
.
.
62275 ORD_ORDER_ITEM_ZJ 678 1174381878
ORD_ORDER_ITEM_ZJ 1,174,380,060. 1 62275
62275 1174380091 ORD_ORDER_ITEM_ZJ. 1 62126
ORD_ORDER_ITEM_CEN 0 0
62126 0 0
ORD_ORDER_ITEM_CEN 62126 ORD_ORDER_ITEM_CEN 0 0
62126 0 0
ORD_ORDER_ITEM_CEN 45772 STATS $ UNDOSTAT 0 0 45772 0 0
UNDOSTAT STATS $
125 Selected rows.
From these hotspots target block belongs view, Most are accessed by that long-running SQL. Did it turn a big circle and the result was a problem with the execution plan?
Check the execution plan of this SQL:
SQL> EXPLAIN PLAN FOR
2 INSERT INTO MAN_ORDER_ITEM (
3 ID,
4 REQUEST_QTY,
5 SALER_ID,
6 PRODUCT_ID,
7 UNIT_PRICE,
8 CREATE_DATE,
9 ANSWER_DATE,
10 BUYER_ID
11 )
12 SELECT
13 A.RECORD_ID,
14 A.REQUEST_QTY,
15 A.SALER_ORGID,
16 A.PRODUCT_ID,
17 A.UNIT_PRICE,
18 A.CREATE_DATE,
19 NULL,
20 A.BUYER_ORGID
21 FROM ORD_ORDER_ITEM A
22 WHERE A.CREATE_DATE >= TO_DATE('2004-01-01 0:0:0', 'YYYY-MM-DD HH24:MI:SS')
23 AND A.CREATE_DATE < TRUNC(SYSDATE)
24 AND EXISTS (SELECT 1 FROM MAN_PRODUCT WHERE ID = A.PRODUCT_ID)
25 AND EXISTS (SELECT 1 FROM MAN_DEALER WHERE ID = A.SALER_ORGID)
26 AND EXISTS (SELECT 1 FROM MAN_BUYER WHERE ID = A.BUYER_ORGID)
27 ;

Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 212 | 37 |
|* 1 | FILTER | | | | |
| 2 | NESTED LOOPS | | 1 | 212 | 37 |
| 3 | MERGE JOIN CARTESIAN | | 1 | 76 | 12 |
| 4 | MERGE JOIN CARTESIAN | | 1 | 51 | 9 |
| 5 | SORT UNIQUE | | | | |
| 6 | INDEX FAST FULL SCAN | PK_MAN_PRODUCT | 1 | 26 | 3 |
| 7 | BUFFER SORT | | 8138 | 198K| 6 |
| 8 | SORT UNIQUE | | | | |
| 9 | INDEX FAST FULL SCAN | PK_MAN_BUYER | 8138 | 198K| 3 |
| 10 | BUFFER SORT | | 14238 | 347K| 9 |
| 11 | SORT UNIQUE | | | | |
| 12 | INDEX FAST FULL SCAN | PK_MAN_DEALER | 14238 | 347K| 3 |
|* 13 | VIEW | ORD_ORDER_ITEM | 1 | 136 | 37 |
| 14 | UNION-ALL PARTITION | | | | |
|* 15 | FILTER | | | | |
|* 16 | TABLE ACCESS BY INDEX ROWID| ORD_ORDER_ITEM_CEN | 1 | 116 | 14 |
| 17 | AND-EQUAL | | | | |
|* 18 | INDEX RANGE SCAN | TU_ORD_ORD_ITEM_PRODUCT_ID | | | |
|* 19 | INDEX RANGE SCAN | TU_ORD_ORDER_ITEM_SALER | | | |
|* 20 | FILTER | | | | |
|* 21 | TABLE ACCESS BY INDEX ROWID| ORD_ORDER_ITEM_ZJ | 1 | 116 | 2 |
|* 22 | INDEX RANGE SCAN | TU_ORD_ORD_ITEM_PRODUCT_ID1 | 179 | | 1 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))
13 - filter("MAN_PRODUCT"."ID"="A"."PRODUCT_ID" AND "MAN_DEALER"."ID"="A"."SALER_ORGID" AND
"MAN_BUYER"."ID"="A"."BUYER_ORGID")
15 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))
16 - filter("ORD_ORDER_ITEM_CEN"."BUYER_ORGID"="MAN_BUYER"."ID" AND
"ORD_ORDER_ITEM_CEN"."SALER_ORGID"="MAN_DEALER"."ID" AND
"ORD_ORDER_ITEM_CEN"."PRODUCT_ID"="MAN_PRODUCT"."ID" AND
"ORD_ORDER_ITEM_CEN"."CREATE_DATE">=TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
"ORD_ORDER_ITEM_CEN"."CREATE_DATE"<TRUNC(SYSDATE@!))
18 - access("ORD_ORDER_ITEM_CEN"."PRODUCT_ID"="MAN_PRODUCT"."ID")
19 - access("ORD_ORDER_ITEM_CEN"."SALER_ORGID"="MAN_DEALER"."ID")
20 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))














2 WHERE TABLE_NAME IN ('MAN_PRODUCT', 'MAN_BUYER', 'MAN_DEALER');
TABLE_NAME NUM_ROWS
------------------------------ ----------
MAN_BUYER 8138
MAN_DEALER 14238
MAN_PRODUCT 0
SQL> SELECT COUNT(*) FROM MAN_PRODUCT;
COUNT(*)
----------
91750
is already obvious, MAN_PRODUCT statistics mistaken. It is precisely because Oracle thinks that the record of MAN_PRODUCT is 0, so it chooses the method of MERGE JOIN, which can get the final result as quickly as possible - 0 records. But the number of records in MAN_PRODUCT is not actually 0, but nearly 100,000 records. This is the real cause of the problem.
In fact, the problem is not over here. There is also a reason why Oracle produces incorrect statistics. First of all, this script will first clear the data of the related table, and then regenerate it. And in the previous execution of this process, it failed halfway. Resulting in no data in the MAN_PRODUCT table. The JOB that runs weekly to collect statistics records the statistics of 0 records in the MAN_PRODUCT table.
When the script is run again, although the records are written to the MAN_PRODUCT table, the statistics are not updated, thus causing this problem.
After understanding the cause of the problem, it is very easy to solve it. Collect the statistical information of the MAN_PRODUCT table and check the execution plan, kill the running JOB, and restart the JOB.
SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER, 'MAN_PRODUCT', CASCADE => TRUE)
PL/SQL procedure successfully completed.
SQL> EXPLAIN PLAN FOR
2 INSERT INTO MAN_ORDER_ITEM (
3 ID,
4 REQUEST_QTY,
5 SALER_ID,
6 PRODUCT_ID,
7 UNIT_PRICE,
8 CREATE_DATE,
9 ANSWER_DATE,
10 BUYER_ID
11 )
12 SELECT
13 A.RECORD_ID,
14 A.REQUEST_QTY,
15 A.SALER_ORGID,
16 A.PRODUCT_ID,
17 A.UNIT_PRICE,
18 A.CREATE_DATE,
19 NULL,
20 A.BUYER_ORGID
21 FROM ORD_ORDER_ITEM A
22 WHERE A.CREATE_DATE >= TO_DATE('2004-01-01 0:0:0', 'YYYY-MM-DD HH24:MI:SS')
23 AND A.CREATE_DATE < TRUNC(SYSDATE)
24 AND EXISTS (SELECT 1 FROM MAN_PRODUCT WHERE ID = A.PRODUCT_ID)
25 AND EXISTS (SELECT 1 FROM MAN_DEALER WHERE ID = A.SALER_ORGID)
26 AND EXISTS (SELECT 1 FROM MAN_BUYER WHERE ID = A.BUYER_ORGID)
27 ;
Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY());
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost |
--------------------------------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 32977 | 6795K| | 72667 |
|* 1 | FILTER | | | | | |
| 2 | NESTED LOOPS SEMI | | 32977 | 6795K| | 72667 |
| 3 | NESTED LOOPS SEMI | | 32977 | 5989K| | 72667 |
|* 4 | HASH JOIN SEMI | | 32977 | 5184K| 4768K| 72667 |
| 5 | VIEW | ORD_ORDER_ITEM | 32977 | 4379K| | 72502 |
| 6 | UNION-ALL | | | | | |
|* 7 | FILTER | | | | | |
| 8 | TABLE ACCESS BY INDEX ROWID| ORD_ORDER_ITEM_CEN | 5797K| 641M| | 479 |
|* 9 | INDEX RANGE SCAN | TU_ORD_ORDER_ITEM_CREATE_DATE | 5797K| | | 16 |
|* 10 | FILTER | | | | | |
| 11 | TABLE ACCESS BY INDEX ROWID| ORD_ORDER_ITEM_ZJ | 2464K| 272M| | 643 |
|* 12 | INDEX RANGE SCAN | TU_ORD_ORDER_ITEM_CREATE_DATE1 | 2464K| | | 21 |
| 13 | INDEX FAST FULL SCAN | PK_MAN_PRODUCT | 91750 | 2239K| | 39 |
|* 14 | INDEX UNIQUE SCAN | PK_MAN_BUYER | 8138 | 198K| | |
|* 15 | INDEX UNIQUE SCAN | PK_MAN_DEALER | 14238 | 347K| | |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))
4 - access("MAN_PRODUCT"."ID"="A"."PRODUCT_ID")
7 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))
9 - access("ORD_ORDER_ITEM_CEN"."CREATE_DATE">=TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
"ORD_ORDER_ITEM_CEN"."CREATE_DATE"<TRUNC(SYSDATE@!))
10 - filter(TRUNC(SYSDATE@!)>TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')<TRUNC(SYSDATE@!))
12 - access("ORD_ORDER_ITEM_ZJ"."CREATE_DATE">=TO_DATE('2004-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
"ORD_ORDER_ITEM_ZJ"."CREATE_DATE"<TRUNC(SYSDATE@!))
14 - access("MAN_BUYER"."ID"="A"."BUYER_ORGID")
15 - access("MAN_DEALER"."ID"="A"."SALER_ORGID")

Note: cpu costing is off

40 rows selected.
SQL> SELECT SPID FROM V$PROCESS WHERE ADDR IN (SELECT PADDR FROM V$SESSION WHERE SID = 70);
SPID
------------
27488
Find the operating system process that JOB is running, and then kill the process through the operating system command kill -9. The JOB running session is difficult to kill through the ALTER SYSTEM KILL SESSION statement, so choose the way to use the operating system command.
SQL> host
$ ps -ef | grep 27488
oracle 28672 28671 0 18:07:20 pts/2 0:00 grep 27488
oracle 27488 1 13 12:46:56 ? 317:21 ora_j000_repdb01
$ kill -9 27488
$ exit
check SESSION and JOB status, confirm the JOB restart.
At this point, the problem is solved. In fact, when the problem was discovered at that time, there were two ways to choose. On the one hand, we started with the operation of the system, which is the method selected in this article. On the other hand, start directly from the SQL statement and check the execution plan first.
Since this SQL has been used many times before, it is normal, so I did not expect that the execution plan will change so much. Therefore, always start from the first aspect to diagnose the problem. Fortunately, the same destination was achieved by different paths. Although I went around a big circle, I finally found the problem.

Refer to: http://yangtingkun.itpub.net/post/468/273340

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326321191&siteId=291194637