Case sharing | Abnormal CPU monitoring

CPU usage monitoring is very critical, and it is one of the important indicators for monitoring to comprehensively reflect the load of the system. The CPU usage has an important impact on the performance of the business system. According to the CPU usage monitoring, the system or application can be further analyzed and tuned.

At 22:00 on April 25, the platform received an alarm that the CPU usage rate of the HIS database server of a county-level hospital exceeded the threshold. The CPU usage rate was 99%, which was far higher than the preset threshold. 

warning message

April 25th

The CPU usage of the HIS database exceeds the threshold.

Event lasts 1 hour and 30 minutes.

process 

The MOC engineer informed the customer's on-site engineer that it is recommended to check the processes that occupy CPU resources through the task manager and check the CPU consumption of application processes. .

Because the customer did not view the processes that occupy CPU resources at the time of the incident, the customer failed to monitor the unnecessary running process, so the problem was not solved in time. The MOC on-duty engineer continued to follow up and sent the HIS database alarm record to the customer again for reminder.

The HIS system is a relatively important system in the hospital, which requires simultaneous multi-task processing and long-term running of the database, so the requirements for the floating-point computing capability of the CPU are particularly high. In order to prevent the system from running slowly or even down due to high CPU usage, after the second reminder, the customer will pay attention.

Under the advice of LinkSLA online experts, the customer engineer ran Process explorer to check the changes in the CPU usage of each thread in the oracle.exe process and check whether the threads in Oracle.exe occupied the CPU.

Through monitoring, the customer found out that the SQL statement process that occupied a lot of CPU resources was optimized and the problem was solved.

Case summary

The HIS system has high requirements on the floating-point calculation of the CPU. When the CPU usage rate of the HIS system is as high as 99%, the system runs very slowly, but after a while, the CPU usage rate returns to the normal value, and the system runs at a normal speed. This phenomenon easily paralyzes customers and ignores the problems in the system.

The HIS system is the core system of the hospital. If it goes down, it will have a serious impact on the hospital business. The LinkSLA intelligent operation and maintenance platform provides early warning, fast positioning, tracking and resolution, avoiding business interruption and ensuring the healthy operation of the system.

In the daily operation and maintenance environment, in order to ensure the stable operation of system monitoring, CPU monitoring also needs to monitor these contents.

CPU monitoring indicators

windows operating system monitoring indicators

CPU idle time percentage

Interrupt CPU time percentage

Percentage of privileged mode CPU time

Percentage of non-idle thread CPU time

Linux operating system monitoring indicators

Idle CPU ratio

CPU waiting for IO time ratio

System CPU percentage

User CPU Percentage

The platform uses machine learning algorithms for anomaly detection, and monitors the status of various indicators of the operating system based on agents, snmp, etc. For indicators that emphasize periodicity, such as CPU usage, if an abnormality is found, the user will be notified in time, and the follow-up will continue until the problem is solved.

LinkSLA intelligent operation and maintenance housekeeper breaks the isolated island of operation and maintenance through the monitoring of the whole link. Provide users with efficient and cost-effective operation and maintenance services, deal with problems with server components in a timely manner, avoid downtime or even data loss due to abnormal failures, and ensure the healthy and stable operation of the business system.

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/130521273