The address of the reprinted article: http://www.oracleonlinux.cn/2012/06/row-cache-lock-performance-tuning/
This article is well written and explains the performance failures caused by the cache in great detail.
Received an urgent email: It was said that the production system of a customer was at around 9:30 this morning, and the response of the entire system in the production environment was very slow. It may take half a minute to query a ticket list, and sometimes the login interface is stuck for a long time before responding.
This is a set of 10.2.0.5.0 two-node RAC database. Normally the system is running normally, but now it is suddenly slow. Well, for me, an optimization rookie, I still start with the AWR report. The analysis steps and solutions are given below:
1 It can be seen from the report header that within one hour of the database performance problem, the DB Time reached 240 minutes, (DB Time)/Elapsed=3.93, indicating that there should be a problem with the database.
Snap Id | Snap Time | Sessions | Cursors/Session | |
---|---|---|---|---|
Begin Snap: | 10638 | 05-June-12 09:00:00 | 54 | 5.8 |
End Snap: | 10639 | 05-June-12 10:01:03 | 102 | 5.8 |
Elapsed: | 61.04 (mins)61.04 (mins)61.04 (mins) | |||
DB Time: |
2 From the Load Profile section of the report, you can see that the user makes up to 680 calls per second, indicating that there must be a problem.
Load Profile
Per Second | Per Transaction | |
---|---|---|
Redo size: | 12,939.46 | 6,115.60 |
Logical reads: | 67,323.18 | 31,819.06 |
Block changes: | 53.22 | 25.15 |
Physical reads: | 1.02 | 0.48 |
Physical writes: | 4.72 | 2.23 |
User calls: | 679.70 | 321.25 |
Parses: | 90.49 | 42.77 |
Hard parses: | 0.35 | 0.16 |
Sorts: | 1.94 | 0.92 |
Logons: | 0.08 | 0.04 |
Executes: | 316.75 | 149.71 |
Transactions: | 2.12 |
3 Continue to analyze the report. From the Top 5, we can see that the first waiting event is row cache lock, and the average waiting time of this waiting event reaches 2128ms.
The ROW CACHE LOCK wait event is a wait event related to the shared pool, which is caused by the access to the dictionary buffer. Usually the direct solution can be solved by increasing the shared pool, but it is not effective in all scenarios.
Top 5 Timed Events
Event | Waits | Time(s) | Avg Wait(ms) | % Total Call Time | Wait Class |
---|---|---|---|---|---|
row cache lock | 2,736 | 5,822 | 2,128 | 40.4 | Concurrency |
CPU time | 4,305 | 29.9 | |||
gc cr block busy | 2,293 | 2,633 | 1,148 | 18.3 | Cluster |
gc buffer busy | 1,569 | 1,096 | 698 | 7.6 | Cluster |
enq: TX - row lock contention | 2,029 | 998 | 492 | 6.9 | Application |
4 Continuing the analysis, it is found that the time-consuming loading sequence sequence ranks second in the time-based statistical information.
Time Model Statistics
- Total time in database user-calls (DB Time): 14406.5s
- Statistics including the word "background" measure background process time, and so do not contribute to the DB time statistic
- Ordered by % or DB time desc, Statistic name
Statistic Name | Time (s) | % of DB Time |
---|---|---|
sql execute elapsed time | 14,188.01 | 98.48 |
sequence load elapsed time | 6,900.83 | 47.90 |
DB CPU | 4,304.59 | 29.88 |
PL/SQL execution elapsed time | 20.64 | 0.14 |
parse time elapsed | 10.25 | 0.07 |
hard parse elapsed time | 6.00 | 0.04 |
PL/SQL compilation elapsed time | 1.17 | 0.01 |
hard parse (sharing criteria) elapsed time | 1.07 | 0.01 |
repeated bind elapsed time | 0.80 | 0.01 |
hard parse (bind mismatch) elapsed time | 0.64 | 0.00 |
connection management call elapsed time | 0.31 | 0.00 |
failed parse elapsed time | 0.00 | 0.00 |
DB time | 14,406.50 | |
background elapsed time | 2,115.75 | |
background cpu time | 20.52 |
5 看到最耗时的竟然是一条再简单不过的SQL语句,SELECT SEQ_NEWID.NEXTVAL FROM DUAL,取序列的值,竟然会如此的耗时?
SQL ordered by Elapsed Time
- Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
- % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
- Total DB Time (s): 14,406
- Captured SQL account for 223.8% of Total
Elapsed Time (s) | CPU Time (s) | Executions | Elap per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|
6,910 | 0 | 281 | 24.59 | 47.96 | 1gd7ancd2px8m | FC.EdiService.Import.exe | SELECT SEQ_NEWID.NEXTVAL FROM ... |
6 再看字典缓冲区的统计信息:取序列值一共287次,就失败了43.9%,看来的确是取序列值的地方出现问题,也就解释了为什么上一步骤中的分许出的那条SQL会如此耗时,因为差不多有一半的情况下都没有取到序列的值。
Dictionary Cache Stats
- "Pct Misses" should be very low (< 2% in most cases)
- "Final Usage" is the number of cache entries being used
Cache | Get Requests | Pct Miss | Scan Reqs | Final Usage | Mod Reqs | PctMiss |
---|---|---|---|---|---|---|
dc_awr_control | 65 | 1.54 | 0 | 1 | 1 | |
dc_database_links | 304 | 0.00 | 0 | 0 | 1 | |
dc_global_oids | 155 | 0.00 | 0 | 0 | 24 | |
dc_histogram_data | 74,704 | 0.25 | 0 | 0 | 5,612 | |
dc_histogram_defs | 71,400 | 0.26 | 0 | 0 | 4,945 | |
dc_object_ids | 29,398 | 0.01 | 0 | 0 | 1,136 | |
dc_objects | 3,912 | 0.23 | 0 | 0 | 860 | |
dc_profiles | 150 | 0.00 | 0 | 0 | 1 | |
dc_rollback_segments | 17,789 | 0.10 | 0 | 0 | 56 | |
dc_segments | 8,927 | 0.06 | 0 | 4 | 896 | |
dc_sequences | 287 | 43.90 | 0 | 279 | 3 | |
dc_tablespace_quotas | 2 | 50.00 | 0 | 0 | 2 | |
dc_tablespaces | 8,954 | 0.00 | 0 | 0 | 25 | |
dc_usernames | 1,082 | 0.00 | 0 | 0 | 8 | |
dc_users | 13,991 | 0.00 | 0 | 0 | 31 | |
outstanding_alerts | 326 | 77.91 | 0 | 23 | 54 |
7 到此,解决问题的基本思路已经出来了,通过将序列缓存到内存中,基本可以解决问题。通过查看生产系统上的该序列的信息,发现创建序列的语句如下:
1 2 3 4 5 6 7 |
|
8 调整序列,使之cache到内存中,alter sequence SEQ_NEWID cache 3000;
/*
sequence cache介绍
CACHE的主要用途是
在获取第一次的sequence时,预先创建出一部分sequence并存放在缓存中
例如:
CREATE SEQUENCE
SEQ_UNIQUEKEY AS INT
START WITH 1
INCREMENT BY 1
MINVALUE 1
MAXVALUE 99999999
CYCLE
CACHE 300
ORDER;
当创建成功时,在缓存中已经存在了1~300号sequence,每次执行NEXTVAL时,都会使用缓存中的seq。当seq增加到300时,下一次的NEXTVAL命令就会重新产生301~600号seq并存放于缓存。
若,当前连接中断,并重新连接数据库时,NEXTVAL命令取得的值就是在缓存中的最小值。
例如:
NEXTVAL.......(执行100次,下次seq为101)
执行,commit,disconnect命令断开连接(或者发生异常,强制commit)
则,下次连接到数据库后,执行NEXTVAL命令取得的seq就为301,101~300号自动作废。
但是因为数据库是rac环境,如果调整为cache,如果应用对序列有顺序要求的这种,调整就会有问题。
比如cache 40,
instance 1 使用1-40
instance 2使用41-80
和开发问过应用对序列顺序要求后,确认应用对sequence的顺序没有太大依赖,所以可以调整sequence的cache。
*/