Cause of the malfunction classification and analysis of prevention and response measures

Every failure is a valuable learning opportunity.

Quotations

Failure is the developer's head hanging a sword. Proverb, saying:. No zuo no die, but the developer is difficult to do no zuo zuo how to prevent die in time for it.?

Know thyself to know yourself. To avoid failure, you need to have a thorough understanding of the relative fault.

Failure, generally refers to a period of time more intensive problem occurs that causes a certain negative impact. Small volume of business problems rarely affect the surface of the fault is not, otherwise it will confuse the true failure, causing irrational distribution of limited resources put into effect to address the progress of key issues; non-intensive sporadic problem may not be a failure, because that may be a small probability event triggers the potential BUG, ​​it needs to be addressed, but as a failure with some reluctance.

To avoid failure, you first need to understand why the failure occurred. The following comes from an analysis of more than failure, categorized and summarized.

cause of issue

Multi-origin

Failure and more origin, refers to the most common cause of failure. Beware of these types of situations, you can prevent most of the probability of failure.

Error core processes

A link core processes go wrong, resulting in the overall process fails, the whole process or part of the business scenario of failure will lead to intensive problems. It is usually added to the main flow of the piece of code, but the code does not take into account the robustness of a scene or poor, affecting whole.

Precaution:

  1. Assess change point! Very important! Even if only one line, as long as the main flow, it must carefully evaluate its sphere of influence. Code is added to the main flow is longer, to be alert.
  2. Increase the necessary try-catch. If the added code only local effects, can add the necessary try-catch, handle unforeseen circumstances prevent abnormal affect the overall process.
  3. The best do not easily change the overall impact of generic methods and configurations (side impact and return to face very large); and not only new modification possible.
  4. Comprehensive coverage of the core processes of test cases, each release is required by return.
  5. The risk of changes to increase the switch. Once the error, close the change immediately.

real case scenario:

Lack of robustness

After the realization of services, robustness is to ensure that services can run smoothly, the correct response to the first hurdle errors and exceptions, and it is one of the necessary codes accomplishment qualified programmers.

Poor robustness, can easily lead to unexpected due to the local details, dirty data, local call fails affect the overall process and display.

Precaution:

  1. Thinking errors and exceptions, the more the better.
  2. Make good use try-catch escort.
  3. Use an empty string, an empty list alternative null.
  4. Test coverage abnormal branch.

real case scenario:

  • Due to a null value causes the entire list of orders failed to load.
  • Due to a minor dependent error causes the entire details page failed to load.
  • Hook code in question, but not tested; when the process went abnormal branch code, kneeling out tasks directly and repeatedly restart and kneel out.

Instantaneous high flow

Instantaneous high flow is caused by the failure of a major killer. Instantaneous flow, will lead to a shortage of machine resources, CPU or memory full or card soared, the number of connections played a direct impact on the stability of the overall service.

For message processing applications, high flow can cause instantaneous message processing delays, traffic flow lagged state, affect the following link; for non-Messaging applications will result in obstruction-tasking interface slow response or no response.

Precaution:

  1. Clustered environments: to ensure that each machine or Region cluster load balancing;
  2. Stand-alone environment: targeted current limit, speed limit and limit the number.
  3. Pressure measurement exercise. After the measurement of the pressure vessel.

real case scenario:

extreme case

Extreme case refers to the occurrence of some very rare event of a partial challenge to the limit of the system, causing the system out of the question.

For example, a number of product types in the order is usually no more than 10, but a single merchant or buyer brush, resulting in a large number of orders containing more than 50 items, then intensive export would cause the application FullGC serious, causing a timeout or the interface response task can not continue.

Precaution:

  1. Thinking extreme situation, and impact;
  2. Ahead of an extreme case testing and design.

real case scenario:

Dependent failure

Dependence can fail for the following situations:

  1. It depends services, configuration or variable does not exist or is not appropriate version, cause the application failed to start, or the service can not operate normally after the start;
  2. When the base on which service a large number of unstable error, depend on it will lead to high-frequency applications are also a large number of error, leading to an avalanche effect.

Precaution:

  1. When a project involving multiple systems or released many details, you need to write publish documents, carefully specified configuration and release order to ensure the correctness of the application-dependent. In the specific release, we will have to strictly enforce publish a list of checkpoints and the release order documentation specified. Check the dependencies: API version, Jar version, dependent services, configuration items, DB field.
  2. Automatic demotion. Strict control of timeout, isolate or remove unnecessary weak dependency.

Capital loss

Customer assets are very sensitive to private property rights. When capital loss occurs, typically the highest failure level.

Capital loss generally occurs: a direct financial loss: the processing system is not considered idempotent message is repeated several times resulting in processing; 2 parties business financial service processing according to the status field based service side, the service side of the base return status field. incorrect, resulting in less operator or operators. 3. induced capital losses, some show information, induce the user to make some kind of hard to recover behavior, such as shipped orders to be shipped for the show;

Preventive measures: 1. Direct dealing with funds business, pay attention to idempotent processing; 2 funds business process dependent on the state.? 3. Elimination of inducing information.

The old and new migration error

When remodeling occurred in technical optimization. For example, the old field of migration model new model, older technology stack migration of new technology stack, the old page to migrate new page. Do the transformation, the focus often is to test new services, but easy to overlook the old tests compatible services.

There is a trade-off of old and new migration: reduce or completely wrong. More thorough migration, error and failure probability will be greater, but the new system will be more refreshing; to make some compromises to the old system, you can reduce the number of errors and the probability of failure, but the new system will take the burden of the old system before the line, follow-up still go wrong.

Precaution:

  1. Shunt. Shunt can ensure that the impact on the surface after the new service line gradually expanded, even if there is not point to consider, will also impact was kept to a minimum.
  2. Fully tested, pre-assessment good test case and strictly enforced.
  3. The old interface to migrate to the new interface structure and values ​​consistent with the agreed return value best. If you want to change, you need to carefully assess good.

Old Code

Admittedly, the old code is in start-up companies have made major contributions toward. However, as time goes on, more and more business volume, complexity is also rapidly increasing, a lot of old code simple process gradually become a "time bomb", suddenly let earthquake startled, people shake.

Preventive measures: regular grooming and clean.

real case scenario:

Data Loss

Data security has increasingly become an important focus of the enterprise. For SaaS, the tenant should ensure that all data and operating independently of each other, can not see and operating unauthorized data.

Preventive measures: 1. desensitization of sensitive data; 2. Avoid covering; 3. access control; 4. XSS security issues.

real case scenario:

Performance issues

Low performance, low throughput in the face of a large volume of business in a short time impact, it is prone to clogging, delays, causing malfunction.

Preventive measures: 1 call volume replacement cycle a single call; 2. O (nlogn) algorithm; 3. multi-process or multi-threaded; 4. Reduce unnecessary access and service dependencies.

other reasons

Equipment and network

Infrastructure equipment and network belong to the Internet, located at the bottom, if there are problems, the impact is huge. When the aging equipment downtime or hardware failure, or sudden disconnection or network jitter, but also easily lead to large-scale failure.

Precaution:

  1. Timely inspection and replacement of old equipment. Spend more money to replace old equipment, downtime problems than to spend time, effort and money compensation, to be much more cost effective.
  2. Spare link and the engine room.
  3. Avoid single points of failure.

Dirty data

Due to lack of overall dirty data association constraint, applied to the dirty data is read, error-prone; if the application has a series of logic processing, data may be generated more dirty, more serious troubles.

Precaution:

  1. Detection and elimination of dirty data.
  2. Avoid making online test data.


Improper operation

Improper operation mainly the following situations:

  1. Concurrent execution of two operations, resulting in an error;
  2. Code improper merge conflict resolution;
  3. Irregular operation, malfunction or loss of control processing initiation system.

Precaution:

  1. Code merge conflict resolution, both sides confirmed.
  2. At the same time the system configuration changes, need to coordinate order to avoid concurrency.
  3. Data restoration work at low peak business period.
  4. Data recovery program to check, to ensure that not lead to new problems.

Troubleshooting

When a failure occurs, the first reaction is not immediate investigation reasons, but immediately stop, the impact was minimized.

  • If we can determine the cause was released, immediately rollback release. After the rollback release, then careful investigation reasons.
  • Timely synchronization progress, so that interested parties informed;
  • Establish a rapid synchronization mechanisms to prevent small problems become a great failure.

In order to reduce the possibility of failure, but also a failure contingency plans well in advance.

  • The strength of the underlying comb-dependent, intensity-dependent determining influence surface caused unavailable;
  • When the strong dependence is not available, the program can be quickly restored, the impact was minimized.
  • Failure exercise. Large flow simulation, extreme conditions and failure occurs, the detection of the emergency plan is in effect and rapid recovery.

summary

Failure, is something every developer are reluctant and even corporate experience. However, every failure, involved with different forms of negligence, unknown, truth, positive thinking, in fact, was a very valuable learning opportunities. Failure, will guide people to arrive more in-depth situation, to understand the nature of things associated with. Face failure, learning from failure in true knowledge, prevent and avoid failure, but better posture.

To prevent failure:

  • The first is careful. A number of mind, an accurate assessment of the impact surface, taking into account considering a return to the old old business functions, double-check the dependencies to ensure the consistency of return agreed, standardized execution;
  • To consider the design and implementation of robust, high-volume and extreme circumstances, avoid low performance.
  • Targeted to avoid security problems and loss of capital.
  • Setting strict monitoring alarm, snuffed the problem in the bud stage.

Guess you like

Origin www.cnblogs.com/lovesqcc/p/11392064.html