Oracle High Concurrency Series 1: Common Problems and Optimization Ideas Caused by DML

[url]http://click.aliyun.com/m/21918/ [/url]


Introduction to the author
Wang Pengchong , a database technology expert of Ping An Technology, has been immersed in the database industry for more than ten years. MongoDB, Redis and other databases have certain architecture and operation and maintenance experience, and are currently indulging in the PK of PostgreSQL database and Oracle database, focusing on the research of distributed architecture of relational database.


Introduction 


Oracle database is designed as a highly shared database. The "sharing" mentioned here can be understood from the aspects of database shared memory, background process, cursor, execution plan, latch and so on. The purpose of Oracle's design is to minimize system overhead and maximize support for more concurrent sessions. It is also based on this design idea, so the vertical expansion capability of a single Oracle instance has always been a leader in the DB field.



I have seen PG Daniel's article before to analyze why Oracle's CursorPin S does not appear in PostgreSQL. The main reason is that PostgreSQL's execution plan is not globally shared, and the same Cursor in Oracle is generally in different sessions. Can be shared (Oracle will also trigger re-hard parsing under certain conditions). Objectively speaking, this design has its own advantages and disadvantages. Although the plan cache of PG is not shared by different sessions, it avoids contention for the same cursor between different sessions during high concurrency, but it also means the same number of concurrent sessions. PG sessions require more caches, and each session needs to be parsed at least once; or conversely, under the premise of the same resource constraints, Oracle supports a higher number of concurrency.



Quoting an Oracle 7 OCP, a veteran Oracle driver: "In the early days, Oracle used session private memory, but when the load concurrently increased, memory consumption became a problem, and the execution plan could not be shared, increasing the rate parse time, for The increase of the parse time of the OLTP system has a greater impact on the overall execution time. Therefore, Oracle has optimized based on this, including session cached cursor and shared pool, etc., which reduces the parsing time and planning time during SQL execution. But there is no free lunch , there will definitely be other consumption, similar to the cost of concurrency protection of memory structures. In short:



session-level SQL parsing is the technology that Oracle used at the beginning.

Any application must be well designed according to the characteristics of the database it uses.



Here Without discussing which database is more NB, the development of each database technology will be affected by many factors, including business strategy, market demand, software and hardware technology maturity, etc. We have used Oracle for many years and have deep feelings for the Oracle database, but currently At the same time, I have no hesitation in investing in open source databases and NoSQL. There is no good or bad technology, and the most suitable application scenario is the best. Here we only focus on the problems that occur when these "shared" resources of the Oracle database encounter high concurrency. Reasons and countermeasures.



Here are ideas, not specific commands.


The processing method here is based on the summary of actual cases that have occurred in the past.



Problems caused by high concurrency DML



Oracle's table is a heap table, and the index is B Tree, when DML is performed on the table, Oracle will operate on the block of the table, and also maintain the block of the index tree, then when there are a large number of sessions at the same time, the same block of the index (or table) needs to be maintained. , there will be contention on the index (or table). When contention occurs, v$session_wait shows the event name that the current session is waiting for.



  1 enq: TX - allocate ITL entry


This wait event indicates that the current session is waiting for the allocation of a transaction slot on a block, which may be a table block or an index block.



Possible reasons are:



Initial trans is set too small.

The concurrent DML is too high, the ITL slot is occupied by other sessions and has not been released, and there is no free space on the block to add a new ITL slot. Therefore, when there is no free ITL slot available, subsequent sessions can only wait.

The settings are reasonable, and the concurrency is not very high, but the efficiency of the running statement has changed, resulting in a longer hold ITL time, which in turn causes subsequent congestion.



Solution:



Rebuild the existing index, and increase initrans when rebuilding, for example, from 16->32->64 (if the pct free is increased when rebuilding the index, more free space is reserved for each block just after the rebuild is completed. , but these free spaces may still be occupied with subsequent index maintenance).

When creating a new index, modify the database development specification. When creating a new index, the default initrans is 16;

Table blocks generally rarely have the above problems, but one of our very frequently updated tables also encountered this problem in the production process, so we also modified it at the end. The default specification of initrans is 6 when creating a new table.

If it is confirmed that the hold ITL time becomes longer due to the decrease in the efficiency of the sql statement, analyze the reason for the decrease in the efficiency of the sql statement and optimize it.



Little knowledge point:



If you see that a session is requesting a lock with lmode 4 in v$lock, one of the reasons may be caused by ITL waiting, and the other reasons may be the concurrent operation of primary keys, bitmap indexes, distributed affairs, etc.



  2 enq: TX - index contention


This problem generally occurs when the table is in a high-concurrency insert operation, waiting on the index block whose field type is date and self-incrementing sequence. Because the application always inserts the latest (high key) value, these indexes generally grow right-sloping, that is to say, the most recent and most frequent operations occur on the rightmost leaf block of the index, and the free leaf block is free. The space is quickly filled up, and then the leaf blocks are split. The splitting process always finds the free block. The process of the index spliter will hold an enq:TX lock, and other concurrent insert processes generally need to go to the rightmost one. The index leaf block goes to insert data, so it has to wait for the splitter process to complete and release the lock. (When the competition is more intense, it will even be generated when the branch block is split)



Solution:  



Delete useless indexes. There are reasons why this obvious measure is put in the first place. Many developers do not know that creating too many indexes on a table will affect DML. They only know that creating indexes can help queries. Some exaggerated ones even create single or combined indexes for each column of a table. . But it turns out that after the sampling monitoring by DBA, many indexes may not be used for a year or a half, so when will it wait until these indexes are not deleted?

Transform the index into a hash partitioned index. The principle is that leaf nodes that operate concurrently can be broken up.

Transform the index into a reversed index. The principle is the same as above, because it is reverseindex, it can also break up the leaf nodes of the high key.

Set a smaller block size, such as 8k -> 4k - 2k. The principle is the same, because there are fewer entries in a smaller index block, which theoretically reduces the probability of two different sessions accessing the same block at the same time, thereby reducing contention. However, this program actually has other side effects, unless other programs can not be considered, otherwise this program is not recommended.

Rebuild indexes. Why rebuilding the index can help this problem, because after rebuilding the index, the fragmentation of the index is reduced, the index block becomes more compact, the time for finding empty blocks when the index leaf block is split is reduced, and the time required for Oracle to split the index is improved. efficiency, which in turn can reduce waiting time.

If the object of index contention is not a leaf block, but a rootblock, you can consider activating the optimization when the root block of the index is split by the following methods:

1) alter system set events '43822 trace name context forever,level1';

2) event 43822 is enabled After that, the split of the root block is enhanced, and Oracle will apply for the allocation of a new block after no more than 5 index block reclamations.



Background knowledge:



Oracle's process of finding reusable free blocks during index split is as follows:



Oracle will not ask the index segment to apply for new space at the beginning (this will cause the space of the index segment to grow excessively), but will search for free blocks elsewhere in the index segment. The requirements for these free blocks are status It is 75%-100% Free. The server process will scan these 75%-100% Free blocks and confirm that these blocks are actually 100% empty. If a 100% Free Block is found, use it; if not, continue to search until All candidate blocks are checked. This behavior is called probes on index block reclamation. Every time it looks for an empty block and fails, Oracle will add this statistic: "failed probes on index block reclamation". Oracle's internal mechanism will control how many times to search, and will not go to FULL SCAN for all index blocks. After failing more than a certain number of times, it will apply to allocate a new block.



There are two reasons why it cannot be reused:  



maybe this block is not 100% free, but 70% ~ <100% free, that is, there are several or more rows of index records on the found block, so it cannot be reused Do split.

Maybe there are some other active transactions on this block, so it can't be reused.



In this process, the block that Oracle has a chance to find is actually a non-empty block in the index structure, but Oracle will only find out that the block is an illegal choice after splitting and relinking to the index structure. At this time, Oracle will Rollback this operation, this statistic is recorded in 'transaction rollback' in v$sysstat, and then continue looking for another block.



In the process of Oracle finding empty blocks, if these blocks are not in memory, physical reads will be increased. If these blocks need to be deferred block clearing or rolled back, more system recursive operations need to be triggered. It can be seen that if "failed" When there are too many probes” and the split efficiency is low, it will directly lead to an increase in index contention.



  3 enq: The High WaterMark (ie high water mark) of HW –contention


TABLE identifies the boundary between used space and unused space in the table segment. Specifically, the status of blocks above HWM is: unformatted and have never been used; below HWM The status of the block is: Allocated, but currently unformatted and unused, Formatted and contain data, Formatted and empty because the data was deleted. When the blocks below HWM have no free space to use, Oracle will push HWM to apply for the allocation of new blocks to the segment, and the HW enqueue lock is used to manage the serial operation when pushing HWM to allocate new space.



Obviously, when a high-concurrency insert occurs, even if there are LOB fields in the table, the situation is even worse. When the speed of HWM's promotion of allocating new space cannot keep up with the speed of the space required by concurrent sessions, it will happen in HW's enq waiting on .



Solution: 



Delete useless indexes.

Transform into a hash partition table. Concurrent space allocation requirements at the same time are spread across multiple partitions.

Manually allocatenew space in advance (can be made into regular automatic tasks).

Active shrink reclaims reusable space to avoid automatic allocate competition during peak business hours.

Set the UNIFORM SIZE with a larger table space, and allocate more extents to the HWM of the table each time, so as to avoid occasionally waiting for the extent allocation of the table space when the HWM is severe.

Make sure to use the ASSM (Automatic segment spacemanagement) tablespace.

The implicit parameter _bump_highwater_mark_count can control the number of blocks that the HWM pushes each time. But setting this implicit parameter should be supported by Oracle and has a negative impact on other small tables.

Check the performance of the IO subsystem. Sometimes changes in IO performance can also cause slow space allocation operations, which in turn cause waits.

Frequent re-recycling of LOB segment space may also lead to this competition. For LOBs, you can appropriately increase chunks and allocate more space each time; you can also actively allocate or shrink.

In addition, there is a Bug 6376915 for LOBs that use ASSM tablespaces. Check whether Has applied fixed patch and is to be enabled by setting event. This event is used to control the number of chunks in one LOB chunk recycling operation (default is 1), which can reduce the number of HWM enq waiting to happen.

EVENT="44951 TRACE NAME CONTEXT FOREVER, LEVEL < 1 -1024 >"



  4 enq: US –contention


This waiting event usually indicates that the session is waiting for the Undo Segment. Note that the reason for waiting is generally not because UNDO TABLESPACE has no space, the UNDO table Insufficient space will directly report ORA-30036 (NOSPACEERRCNT).



Typical scenarios that cause this wait are:



If the UNDO tablespace is AUTOEXTEND, Oracle will automatically adjust the undo retention. On the basis of keeping the undo block retention period set by the retention parameter as much as possible, it will also try to meet the read consistency requirements of some long queries. When this feature comes into play, many UNDO segments are used to support long queries (MAXQUERYLEN). When there are many concurrent sessions that need to apply for the allocation of undo segments at the same time, Oracle's recovery mechanism (UNXPSTEAL) will be stretched.

A large number of active undo blocks are being rolled back and cannot be reused, possibly due to a long transaction that was killed not long ago.

It is also possible that although there is free space, a large number of undo segs need to be changed from offline to online in a short period of time after high-concurrency transactions enter the database due to application restarts or punctual rush-selling applications, but smon does not process it so fast. , so there may be a large amount of enq:US-contention for a short time, which is usually accompanied by a large number of 'latch: rowcache objects' (on DC_ROLLBACK_SEGMENTS). One of our insurance systems had this problem in the background database during the double 11 rush sale.



Solution:



If you expect to do a rush sale, you can maintain it in advance, set _ROLLBACK_SEGMENT_COUNT to a higher value, and keep a certain number of undosegments always online.

Set event so that SMON will not automatically OFFLINE the undo segment:

alter system set events '10511 trace name context forever, level1 ';

Temporarily set _UNDO_AUTOTUNE to FALSE to prevent Oracle from automatically adjusting the undo retention to a large value when the UNDO TBS is idle and occupying too many undo segments in advance.

Setting _HIGHTHRESHOLD_UNDORETENTION allows Oracle to automatically adjust undo retention, but setting a ceiling for it will not be unduly affected by MAXQUERYLEN.



Important note in this article:



The introduction of all the above implicit parameters is to deepen the understanding of Oracle related management mechanisms on the one hand; Before the production environment is activated, please obtain the confirmation and support of the original Oracle factory, and be sure to cancel the implicit parameter during the peak period or after the problem is solved urgently.



Do not arbitrarily use implicit parameters in the production environment, this is the most basic principle of database operation and maintenance!



Summary The solutions to the



above problems are actually palliative, not the root cause. These optimization measures may be able to help your system survive the current system peak, but as time goes by, when a larger peak appears, the problem will occur again. . The effect of optimizing the "demand for the database" is always greater than optimizing the "resources that the database can provide". Although sometimes the cost of optimizing the "demand for the database" is higher, the input is generally proportional to the output. . In this sense, if the application can reasonably control the concurrency, introduce the cache layer into the system architecture, adopt the asynchronous queue processing mechanism, optimize the DB model design and SQL writing, etc., this is the fundamental way to solve the problem.



This article is from the Yunqi community partner "DBAplus", the original release time: 2016-10-08

[url]http://click.aliyun.com/m/21918/ [/url]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326628095&siteId=291194637