Architecture Quality Engineering - Exception Control

abnormal discovery

Exceptions usually have a dedicated log library to print related exceptions, but many students will habitually print them directly to the console in the code, which is not a good behavior.
We should always pay attention to a few points in abnormal discovery:

1. Cannot eat exceptions.

  1. Try Catch cannot ignore exceptions. Once using Try Catch we should also continue to throw exceptions up.
  2. Don't use too general error codes. For example, if you encounter an error, some students will throw a 500 and use different copywriting to distinguish. This is not easy to distinguish and manage exceptions. Therefore, we need to use different error codes for each exception, which is conducive to the control of exceptions.

2. Do a good job of blocking upstream and downstream

  1. For example, if you make a trading system, there will definitely be a shopping cart module and an order module. There is an upstream and downstream relationship between them, and here, the exception of the shopping cart needs to be digested within itself, and an upstream and downstream block is completed for the upstream and downstream transmission. Here we are required to try our best to transfer exceptions internally, and use error codes to transfer between different modules. For example, if a business exception occurs in the shopping cart, it may be thrown in the format of an exception in the code, but when passing it downstream through the shopping cart, we need to pass it downstream with a clear error message . Passing it as an exception may cause problems with downstream processing. On the other hand, when statistics are made on the current interface, statistics errors may also occur due to abnormal reasons. Therefore, using a unified error code and copywriting as the upstream and downstream transmission can help the upstream and downstream to process the business, thereby minimizing the impact of errors.

3. Abnormal classification should be done well

4. Exceptions are better than Null

For some situations, we recommend throwing an exception instead of returning Null, because once Null is returned, it may lead to improper processing and cause an exception.

5. Do not print exceptions to the console

For some beginners, they will habitually print the exception to the console. In fact, we still need to output the exception to the log file through the log library.

exception control

exception type

For exceptions we need to consider the type of exception.
For example: business exceptions, external exceptions, system exceptions, timeout exceptions, and protection exceptions.
These five types of exceptions are the five most common ways we do exception control.
Of course, we will have more detailed methods for business exceptions and system exceptions, such as parameter exceptions and exception classification for certain types of business scenarios. All can improve our targeted exception handling and alarm.

exception classification

We will also have abnormal grading, which includes: first level, second level, and third level. That is, after dividing the exceptions into levels, pay more attention to the first level, and pay attention to the second level appropriately, while for the third level, it is mainly to pay attention when inquiring about some problems or obtaining more information.
Of course, the two points of "restraint" and "link protection" also need additional protection.
The role of "restraint" is to control exceptions within the module as mentioned above, and to transmit relevant error codes for the upstream and downstream of the module.
The second point is to do a good job in the frequency of abnormal alarms. For the same abnormality, it is enough to use one application to alarm. Do not report the same error at the same time for a large number of applications. In this way, a large amount of noise information will appear.
And "link protection" requires us not only to prevent and control exceptions, but also to make reasonable use of exceptions.

Business exception

Business exceptions are usually divided into two categories

1. General exceptions

Abnormal user behavior may occur under the system design.
For example, the order interface may detect the user's login status. If the user is not logged in, we will throw an exception such as "User not login!", and the exception belongs to the type of user that may appear anywhere in our system design. behavioral scene. Limit purchases for these scenarios.

We can also give another example. When our product is limited to two items, if the user submits three items, it may hit a pre-set exception scenario. Then we can think that this is an overrun exception, and we can throw a fixed error code for such an exception scenario, and when the upstream processes such an exception information, it can be converted into a friendly user prompt , to prompt the user to do the corresponding operation, for example: "Please reduce the three items in the shopping cart to two items." So this category belongs to some scenarios that users may face in normal operations. In these scenarios, the interaction between our system and the user allows the user to return to the normal state from this status code.

2. Unconventional exceptions

This type of anomaly refers to an anomaly that we often do not pre-predict that it will appear, but it appears. And this kind of exception is not a behavior that should be there.
For example: "Cheating behavior"
Our interface requires the user to do corresponding authentication on the previous page, such as verification code, the user can only click the registration button after obtaining the verification code, otherwise we will consider the user to bypass If we change our verification method and call our verification interface directly, it is a "cheating behavior". At this time, we need to throw a business exception for him. Then this business exception is not caused by a normal normal behavior, but is an unreasonable behavior caused by the user using some means to bypass the framework of the system design.
Another example: "Behavior beyond authority"
A user wants to call an interface to delete a product, and the product itself cannot be deleted by the user, so he wants to delete the behavior beyond this authority. At this time, we also need to return an exception to it. This exception You need to tell him clearly that you can't do this! Of course, in terms of page interaction, it is possible to tell him in a more reasonable interactive way.
Of course, this type of illegal request also needs to be paid attention to when we design unconventional exceptions, and we cannot leave some openings for illegal elements to exceed their authority and cheat. So this is what we need to focus on.

System exception

This exception is also a problem that we usually don't want to encounter. Because usually the report of system abnormality is some problems that have occurred in the system, or some problems that will occur in the future.

abnormal problem

For these abnormal problems, there may be problems with the system itself. For example, there are bugs in the code, which lead to null pointers, arrays out of bounds, and so on.
Conventional examples:
NPE, OOM, ClassCastException, IllegalArgumentException...
These problems all belong to some problems in the code or in coordination. As a result, the system responds to related requests with abnormal behavior, resulting in some problems.

Abnormal occurrence frequency

We need to pay attention to the frequency of abnormal occurrences, because if there are large batches of continuous or occasional occurrences, it means that the current system has a clear problem. At this time, we need to quickly intervene to find out why the problem is caused. Of course, if it happens periodically, you need to consider what cycle it fits with, to find out the cause and factors of the occurrence.
If it occurs after the release, you need to pay attention to whether the problem was introduced after the release. If it is, you need to roll back immediately to solve the problem quickly.

timeout exception

For timeout exceptions, it usually reflects a system stability problem, because we can think of a timeout as a system whose response time exceeds our settings. So when the timeout will occur is also determined to a certain extent by the timeout period we manually set.
Because the shorter the timeout period you set, the easier it will be for timeout exceptions. Of course, when we consider this aspect, we still have to aim at "link final availability". Because a single module may not time out, but the entire link will time out. For example, what is the setting for a single module to 3 seconds? Or 5 seconds? Or maybe 10 seconds? The longer the waiting time, the higher the fault tolerance of a single module. At the same time, the pressure on the entire link will be greater. Because if you have five modules, and each module takes 10 seconds, then the final waiting time on the overall link may be 50 seconds.

When we consider this time, we actually use "user loss" as the wind vane. A user may be willing to wait 3 seconds or 5 seconds on a page, but if you want a user to wait 30 seconds on a page, then he can close the page in less than 30 seconds. So when considering this issue, we need to reduce our timeout as much as possible.

In addition, it is also necessary to consider "full link time-consuming expansion".

infrastructure failure

Therefore, when considering timeout exceptions, the first thing we consider is whether our infrastructure is faulty. If some signals are clear, it tells us that there are some problems with the infrastructure, such as: network delay. When the current computer room of the network is in a weak network environment, or the network is disconnected, it must mean that the request time it takes must be extremely long.

Another point is FullGC. Once FullGC appears, Stop the World will appear.
Then all requesting threads will stop and wait for the garbage collection to complete. In this state, the response time itself will be pulled very long. Therefore, if there are frequent FullGCs, the request will be uninterrupted.

The third is the machine load. The higher the machine Load, the heavier the current machine load. Generally speaking, if the Load is 1, it means the full load of 1 CPU. If the Load is higher, for example, if the Load is higher than 5 or even reaches 10, it means that the current machine has reached the full load of 10 CPUs, and this status also means that the individual requests The response time will be very long, and there will be very fierce competition between the second threads. Therefore, thread switching will also bring a very large time-consuming.
In this case, large-scale and intensive timeouts will occur.

middleware problem

Generally, when we do high-concurrency architecture, caching is the point we need to focus on optimization, because caching can greatly reduce the request time of our system. But once there is a problem with redis, all the ability to help us return quickly through Redis disappears. In other words, all queries may be dropped into the database, and the query time will become longer.
Another point is that if we use MySql, if there is a problem with the performance of Mysql itself, or if the CPU memory is too high, it also means that its response speed to sql will become very low. Another point is that if there is a full table scan and the scanned data is particularly large, it will use the Mysql buffer to a large extent, and once the buffer is high, it will read and write to the disk, which will lead to a very slow response.

program performance issues

The most common one is "slow SQL". We need to design indexes, change optimization conditions, change large tables to small tables, and change large queries to small queries, all of which can effectively improve the execution efficiency of our SQL.
Slow SQL will drag down the performance of Mysql itself, and slow SQL will cause each query to become very long, and the connection pool will be exhausted soon. After the connection pool is exhausted, there will be request waiting, which will further drag down the overall performance.
The second is slow code, that is, if a certain piece of code executes very slowly, we can rewrite the code to improve the execution efficiency of the code itself, and optimize the code as much as possible to improve the response time of our system and avoid timeout exceptions.
The third is thread competition. This high-speed thread switching state occurs under heavy traffic. In this case, the extra consumption caused by competition will drag down our system.
No matter which of these three is the object we need to focus on.

exception classification

  1. Level 1: Focused exceptions that can cause failures (must impose concern)
  2. Level 2: Anomalies that may reflect unhealthy systems or occasional strange anomalies. (Focus on statistics)
  3. Level 3: Routine business exceptions or other abnormalities that will not deteriorate (sampling concern)

Exception discoverable (reachable)

Alarm touch:

  1. group message
  2. mail
  3. Short message
  4. Telephone

Abnormal Convergence (Anti-Bomb)

  1. Anomalies of the same type at the same time, aggregated alarms
  2. Minimum reach, directional alarm

Exception filtering (anti-fatigue)

Filter suspicious alarms to avoid interference
Control alarm effectiveness to prevent fatigue

Guess you like

Origin blog.csdn.net/qq_45455361/article/details/127050862