The feasibility of ensuring system stability under operation and maintenance difficulties

The dilemma behind rapid business development

With the rapid development of business, the operation and maintenance system has been gradually improved. Business stability and service quality are also growing healthily under the mutual support of monitoring, availability and other systems. All problems, faults and factors affecting stability are within controllable and convergent ranges, and everything is developing in a good direction.

Is it really as beautiful behind it all as it looks? In fact, this is not the case. The rapid development of business is bound to leave behind various hidden dangers and problems. Think about whether you are also troubled by similar problems:

  • 1. The monitoring alarm notification is too noisy, the normal alarm channel is artificially congested, and the actual reading rate is extremely low; does it sound familiar? Imagine if a very important business monitoring alarm was submerged in this way and was artificially ignored. As an operation and maintenance person, would you be in a cold sweat? In addition to layering monitoring and streamlining monitoring and alarming, what else can we do?
  • 2. The business data is abnormal, but the alarm and availability data are always in a normal state. Faced with the accusation from front-end business students "Why is the monitoring not found?", all I can do is say "I'm sorry, I will improve next time." What else can we do?
  • 3. Alarms and abnormal availability fluctuations occur, but business indicators do not fluctuate significantly. In order to solve this problem, it is obviously a very helpful improvement to the business, but the business students just don't understand and don't support it. Besides feeling powerless, what else can we do?

where is the problem

Raising these questions, let's look at them one by one to see what is the essence behind them?

Why are there a large number of monitoring alarms? The fundamental reason is that we have adopted methods such as wide distribution points, high coverage, and "checking for leaks and filling gaps" to minimize the number of missed monitoring reports when business abnormalities are caused by the lack of monitoring points.

Yes, that's right. The intentions are good, but the results often backfire. Especially when the number of monitoring points and business complexity continue to increase, the information noise caused by monitoring and alarming will become larger and larger. When the amount of alarm information reaches a critical point, all alarms will become noise or even pollution. The purpose of the monitoring and alarm system will also collapse instantly like a "domino" after reaching this critical point, heading towards the bottomless abyss in the other direction.

Is the monitoring of a large number of technical indicators recognized by business students? From the actual situation, the situation may not be optimistic. It often happens that when operation and maintenance and business students are benchmarking and discussing issues, everyone is talking to each other and has no idea what they are talking about.

Yes, maybe the root of the problem lies here. Can the extensive monitoring we do positively help stabilize and improve business indicators?

In particular, the situations mentioned in points 2 and 3 above are fundamentally caused by the fact that operation and maintenance and business students are not in the same context. One side is business data-oriented thinking, and the other side is technical data-oriented thinking.

Is there no solution to seemingly irreconcilable contradictions? Of course not, the "Business Market" came into being in this environment and situation. "Business Dashboard" is not just a tool, report or platform. It is a technology-driven way of thinking based on key business indicators, allowing multiple parties such as operation and maintenance and business to communicate in the same context.

Solution to the problem

First of all, operation and maintenance students need to change their thinking and consider issues from the perspective of the business side. Putting aside all technical indicators, first try to communicate with business students to understand what indicators they are most concerned about?

  • Taking Web business as an example, business students may be most concerned about UV, PV, homepage opening time, etc.;
  • Taking e-commerce business as an example, business students may be most concerned about transaction conversion rate, transaction success rate, etc.;
  • Taking distribution business as an example, business students may be most concerned about download conversion rate, next-day retention rate, etc.;

After identifying a series of key indicators, extract the most critical 1 to 3 items. Why extract it again?

Because the key and core paths of the business are very important, avoid paying attention to all indicators. The result is that nothing is paid enough attention to.

After clarifying the key indicators, we then construct the key indicators according to the usability system method. In addition to key business indicators, we also need to analyze from the following dimensions:

  • Baseline and scope: the preset baseline values ​​and activity thresholds for key business indicators. Expected fluctuations within the activity threshold, centered around the baseline, are normal. Anything outside the activity threshold range is an anomaly.
  • Period-on-period: Comparing key business indicators between the same time period and the previous time period. For example, compare the result at 17:22 with the result at 17:21. If the result fluctuation is within the threshold range, it is normal, otherwise it is abnormal.
  • Year-on-year: Comparison of key business indicators at the same time point in two time periods. For example, compare the result at 17:01 on April 25 with the result at 17:01 on April 24. If the result fluctuation is within the threshold range, it is normal, otherwise it is abnormal.

In order to reduce false positives, it can be used in combination with month-on-month, year-on-year, and even baseline indicators.

write at the end

With the corresponding "Business Market" indicator data results, because it is based on the core business indicators, it is easier to put operation and maintenance and business-related students in the same context to communicate, so the goals are clearer and the problem can be solved. The direction is also more focused. Efficiency improvements will come naturally.

Of course, only by constantly benchmarking, improving and optimizing relevant core indicators with business classmates can we continue to enjoy the enjoyment and pleasure brought by the "business market".

Based on the "business overall", can we play more tricks to further improve the stability of the business? Welcome to pay attention to "Putting Operation and Maintenance Stability in Front of Business - Disaster Recovery Drill" recently produced by the plan.

For more Linux consultation, please visit www.linuxprobe.com

Guess you like

Origin blog.csdn.net/weixin_56035688/article/details/133498398