Service Case|CIS database failure problem

1. Alarm notification

In the early morning of December 20, the platform received an alarm notification from a tertiary hospital, indicating that the remaining space in the log file of the tempdb instance of the core business CIS system database was insufficient.

Viewing the alarm details page shows that the tempbd log file usage increased abnormally after 1:30 am.

A little tip

tempdb is the system database of the SQL Server instance and is also a temporary shared resource in the instance. When the server is restarted, Tempdb will be rebuilt, so Tempdb has no way to permanently save data like other databases. In other words, Tempdb is a temporary database that processes intermediate data for various requests in the instance. It will be automatically released after the task is processed and will not occupy memory.

2. Problem handling

The MOC notified the on-site engineer that the SQL server's tempdb instance log file has grown abnormally, and the usage rate has reached 99% and is about to be full. This may cause the temporary table to be unable to be created or the data transaction to be submitted, and a data engineer needs to be contacted for processing.

Because it was a problem with the tempdb log file, the data engineer did not pay enough attention to it and did not handle it in time. The alarm problem continued.

The problem continued until 16:30 in the afternoon, when the engineer was about to leave work. The MOC engineer observed that the data volume of the C drive of the database server increased simultaneously with the size of the log file of tempdb. The log file of tempdb increased from 0 to 28.52G. Coincidentally, the C drive of the server was available. Space dropped from 45G to 12G. The size of the Tempdb log file is still increasing. According to the growth trend, the C drive space of the CIS server will be full at 23-24 o'clock at night, which may cause the CIS system to crash. The CIS system is the core system of the hospital. Once it goes down, it will cause immeasurable losses.

In short, the simultaneous growth of data will cause the C drive to become full. Looking at this trend, it is not surprising that downtime will occur in the early hours of the night.

MOC communicates with user engineers and recommends processing the growing tempdb log file data before leaving get off work. It’s not that downtime at night can’t be solved, but it’s more cost-effective to deal with it now.

The data engineer learned that the C drive would be full at night, which might cause a downtime. He immediately took action, stopped the ongoing tasks of the SQL server, shrank the tempdb log file size, released the C drive space, and solved the alarm problem.

3. Summary of the problem

tempdb is a temporary database that handles intermediate data for various requests in the instance. Generally, it will be automatically released after the task processing is completed, so after the platform alarm is notified to the data engineer, it does not attract enough attention. LinkSLA online engineers continued to follow up on the problem and found that the tempdp log file data continued to grow, resulting in insufficient disk space in the operating system. According to platform trend calculations, the CIS system would be paralyzed at night.

The so-called good warriors do not have great achievements. This is the case in operation and maintenance. They do not take big risks and accumulate small wins into big wins. Start with the details and nip them in the bud. Do not ignore low-level incidents. If small problems are delayed or accumulated, they will also cause major incidents of systemic downtime.

The normal operation of the system is the result of all hardware resources cooperating under system instructions to achieve comprehensive data monitoring and real-time automatic inspections. It can detect problems in a timely manner, actively respond to problems, proactively defend and eliminate them accurately.

LinkSLA steward-style operation and maintenance service

LinkSLA intelligent operation and maintenance manager is not only a tool, but also substantially participates in the user's proactive and preventive operation and maintenance process.

1. 7*24 online duty

MOC engineers monitor platform alarm information online in real time, and after screening and preliminary positioning, they generate work orders to notify user engineers; the closed loop of work order processing not only reduces the workload of user engineers, but also filters out invalid alarms and work orders.

▲7*24 online, closed loop of work orders

2. Full stack monitoring

Realize unified monitoring of equipment, system software, application software, and security logs.

▲Full stack monitoring

3. Machine learning algorithm to achieve accurate alarms.

Different from traditional static threshold alarm algorithms, machine learning algorithms are trained on historical data to detect abnormalities in normal business operations, greatly improving alarm accuracy.

▲AI machine learning algorithm alarm details

4. Real-time inspection and accurate detection of the real-time status of the system.

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/135222715