Oracle performance problem analysis IO-database performance problems caused by storage fiber switch failure

background:

A person in charge of the project reported to me that in the early hours of this morning, there was a job running 30 minutes slower in the system, and I needed to diagnose that there was a problem.

problem:

What is the reason that the system operation time is 30 minutes slower than usual?

main idea:

What you see is not the truth and all of the facts, perhaps only phenomena, or just clues.

In order to pursue the truth and all of the facts, you need to be patient and systematically conduct scientific analysis.

Or comprehensive, or primary and secondary!

Methodology:

The problem analysis is three-dimensional:

  1. Specific user perception: occurrence time, occurrence system, event description (unavailable, usable but slow, how slow)

1) Data collection: AWR report, OS resource usage, database and OS log, APP log

  1. Data analysis: comparative analysis, sign analysis, progressive advancement (recursive causal logic thinking).

After processing:

1) Concretize the user's perception

1.1 Time of occurrence: 1:30 in the morning-04:00

1.2 Generating system: X computing system

1.3 Event description: The calculation time is 30 minutes slower than the previous cycle, the data volume has no obvious change, and the program code is the most

No changes have been made.

2) Data collection:

2.1 AWR report: Generate a report from 01:00 to 04:00.

2.2 OSWATCH: Generate reports from 01:00 to 04:00

2.3 Collect DB, OS, APP logs

3) Data analysis:

3.1 Common methods for rapid processing-logo analysis:

a) Check the DBTIME, DBCPU, load profile, TOP EVENT, TOP SQL of AWR
, and
find the important sign top event:

Event Waits Time(s) Avg wait (ms) % DB time Wait Class DB CPU
28,103 47.79 log file sync 787 3,187 4049 5.42 Commit
log buffer space 108 3,026 28022 5.15 Configuration
log file switch completion 816 904 1108 1.54 Configuration
log file switch (checkpoint incomplete) 52 767 14745 1.30 Configuration

b) The average waiting time, the longest is 28 seconds, of which log file sync is 4 seconds, which is very important and suspicious.
At present, after our database storage adopts all-flash memory, the normal avg wait(ms) is within 1ms. Even the waiting time cannot be captured.
<<<<<<<<Comparative analysis

c) Think about the circumstances in which these waiting events will occur? There is also such a long average waiting time.

c-1: conventional thinking:

The redo logfile size is not big enough >>> Further confirmation, it is found that log switch 70 times an hour >>>logfile parallel write avg wait 1ms

The log buffer is not big enough >>>Redo log size is 22m per second, and the current log buffer is far less than

22m * 600,

(1) Commit >>>> frequently to check the code,

(2) IO becomes slow >>> check the awr io stat option, in the tablesace io stat column, it is found that the system AVG BUFFER WAIT (ms) is 200 >> What is the reason for the serious waiting of the system, but there is no other >>> >There are a large number of IO writes to SYSTEM, but the data shows that there are only more than 900 write requests, which is obviously not true. >>>>>system is on the same disk group as other tablespaces, why is there only system space waiting so obvious>> >>Said it is obviously not due to the read and write of other table spaces that caused IO competition. So are there other nodes? >>>>Check other nodes at the same time and see tired-like events, but after comparing awrddrpt through the generation, it is found that no additional SQL appears >>>So what is the reason? >>>>Other data auxiliary analysis.

d) Analyzing the database log, no obvious ORA error was found. Analyze the OS log (they did not have the right to view the os log at the time, and the analysis was not in place). In fact, you may see a multipath error here (or look at the storage of multipath software logs) . Later, with the assistance of the storage administrator, I really found one The fiber optic switch is damaged. When the database sends an IO request and is distributed to the damaged switch by the system, the IO response will slow down, because the switch port is open, that is, the data cannot be forwarded to the storage. When it receives the data sent by the OS, and the forwarding fails within the specified time, it will use another forwarding, so the data in the database is complete, but the overall response time slows down.

So far, the truth has surfaced. . . In fact, before the truth appeared. . . . . I have taken a lot of detours.

solution:

1. There are two storage optical fiber switches that use polling to form a high-availability architecture. At present, only one is damaged and the other is normal, so the damaged switch port is closed, and the OS automatically forwards the IO request of the database to the normal one. On the switch, then to the storage.

The test is verified and implemented in production.

  1. Notify the manufacturer to replace the hardware.

Supplement: In this incident, I did not have any deficiencies. I did not check the OS and storage path. This was a major oversight. Hope to take this incident as a warning.

Guess you like

Origin blog.csdn.net/oradbm/article/details/109031412