In just 2 months, alarms were reduced by 65%. What did this company do right?

user stories

Mr. Liu is the head of the information department of a company in Zhejiang. He has launched a well-known domestic network management operation and maintenance software product many years ago. "It was a failed project. Our operation and maintenance engineers suffer from alarm storms every day. Important alarms are overwhelmed by massive and invalid alarms. You must know that the company has to spend a lot of time to deal with each alarm. Dealing with it will incur a large risk cost.”

Mr. Liu asked the engineer if he could try another operation and maintenance platform. But the engineer told him that other products are similar, and there seems to be no other good way to deal with all alarms in a timely manner like banks and operators do, except to increase front-line personnel. Later, Mr. Liu saw LinkSLA's case push, and using their butler-style operation and maintenance service, the business system failures dropped from 20 times a year to zero . With the mentality of trying it for free, he arranged for engineers to try to subscribe for two months.

"Why choose LinkSLA? Because of the SaaS subscription model, if I don't meet my expectations, I won't renew the subscription, and the cost is very little, and I don't have much loss." Mr. Liu is a little proud, "But the effect of using it is unexpected Well, I can tell you clearly that now our invalid alarms have been reduced by 65% ​​and MTTR has been reduced by 30%."

In the operation and maintenance work, alarm management is a very important step. It can not only greatly improve the efficiency of operation and maintenance work, but also help enterprises form the best event management process and make the business system run healthier and more stable. In order to improve the accuracy of alarms, LinkSLA A lot of work has been done on alarm rules, for example, AI rules, alarm aggregation functions, multi-condition combination alarms, alarm dependencies, rule-based masking, rule-based masking based on time periods, and child-object-based masking. Of course, one important point is , moc engineers will review the alarms, remove the false and retain the true, adjust the rules, and notify users that the alarms must be real and valid.

LinkSLA service plan

1. AI machine learning to create the strongest alarm system

As the number of enterprise business systems increases, the number of monitoring objects and indicators increases exponentially. Configuring static thresholds according to the traditional method not only consumes a lot of manpower, but also easily causes inaccurate alarm information. Machine learning liberates operation and maintenance from tedious events, and is more efficient when applied to abnormal alarms, alarm convergence, fault analysis, and trend prediction. But it is not easy to create an AI algorithm that can achieve high accuracy in a large number of actual scenarios. LinkSLA is an incubator of the School of Artificial Intelligence of Nanjing University. As early as 4 years ago, it cooperated with the professional team of Nanjing University to create a set of AI large models with practical value, and innovatively adopted the "large model, small software "The model ensures that this algorithm has a high accuracy rate in a variety of actual user landing scenarios.

1. Build Adaptive Anomaly Detection

With cyclical, trending, and seasonal indicators, machine learning can build adaptive anomaly detection. For example, the CPU baseline during the day is different from that at night, and between January and June; in addition to single-index abnormal monitoring, multi-dimensional data analysis can also be performed. For example, if the response time of a certain business system exceeds the normal range, the key points of the business components will be monitored. Indicators, such as CPU usage, memory usage, disk and network IO, JVM usage, etc., are based on the decision tree analysis model, automatically confirm the impact weight, and perform multi-dimensional data analysis.

2. Capacity trend forecast

Forecast the trend of capacity indicators of user assets, such as file system space, database table space, etc., and give early warning according to the growth trend, so that users have enough time for data cleaning, expansion or migration, etc.

3. Provide a visual unified interface

Provide operation and maintenance personnel with a visualized unified interface, intelligent abnormal alarms, alarm confirmation based on dynamic thresholds, abnormal detection of massive timing indicators, and help operation and maintenance personnel quickly identify and predict possible problems. Based on the AI ​​machine learning algorithm, the root cause of the problem is analyzed, greatly improving the accuracy of alarms and improving operation and maintenance efficiency.

2. Platform + service, on-duty operation and maintenance platform

The technological innovation has greatly eliminated false alarms. What LinkSLA delivers to end customers is not only a set of operation and maintenance software platform, but also provides customers with a "housekeeping style" in the mode of platform + work order alarm on duty service operation and maintenance monitoring service. The 7*24-hour on-duty engineer in the background will actively help users receive work orders, coordinate and process work orders, track and supervise the whole process, and form a closed-loop service online and offline.

Case Studies

During the holidays, the on-duty engineer often receives a work order in the early morning that the core file system space of a customer is full. According to the SLA agreement, the customer will not be notified until the morning work time to deal with it. It turned off automatically, and it has been like this for several days.

The careful MOC duty engineer checked the historical data and found a pattern: 1T of space will be fully occupied every morning, and around 9 o'clock, 400G of space will be released. The MOC duty engineer checked the relevant disk capacity, disk IO, application process and other data, and analyzed that the user performed backup during this time period. After contacting the customer for confirmation, we further analyzed the backup log and found that backups often failed due to insufficient space, and the customer thought that the core business system data was already backed up. MOC engineers immediately communicated with the customer and adjusted the backup plan, and the problem was completely solved.

The three elements of operation and maintenance are "people, tools, and processes." Most customers often only have on-site station or response personnel. For front-line on-duty engineers who are responsible for handling alarms and work orders first, except for very large customers such as large banks and operators, Also, other clients are not configured. The consequences of this situation are often "passive" and "fire-fighting" processing, and early warnings and hidden dangers before disasters occur in business systems, data, network security, etc. cannot be discovered and eliminated in time. LinkSLA's "platform + on-duty service" model can truly become the user's "operation and maintenance steward", providing users with proactive services that can "eliminate hidden dangers in advance" and "handle hidden dangers or faults in a timely manner."

3. Eliminate false alarms and reduce operation and maintenance costs

In operation and maintenance practice, operation and maintenance monitoring services need to tell operation and maintenance personnel simply, efficiently and accurately where there are hidden dangers or faults that need to be dealt with. LinkSLA Intelligent Operation and Maintenance Manager starts from the needs of users. It first uses technological innovations such as AI that have been tested in actual combat to eliminate most "false alarms", and then combines "alarm and work order attendance services" to solve the operation and maintenance process for users. The most cumbersome and difficult to arrange human investment, the operation and maintenance has been transformed into a kind of "active" and "clear process" work.

In addition, the on-duty service not only provides first-line alarms and work order on-duty, but also provides online support from second-line engineers and the most experienced industry experts, which can greatly improve and accelerate users' ability to analyze and solve problems.

picture

picture

   

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/132451421