How to make the error log easier to troubleshoot the problem

Tips: No public concern: blackboard preserved egg, and receive a monthly salary of 25K + programmers to Cheats, BAT necessary to enter!

 

Beat error log in the program main goal is to provide important clues and guidance for better troubleshoot and solve problems. But playing in practice the error log diverse content and format, the error may be incomplete, no background, its meaning is unknown, making investigation to solve the problem becomes very inconvenient or time-consuming operation. In fact, if a little programming time carefully, it will reduce a lot of wasted effort to troubleshoot the problem. In explaining how to write effective error log to understand how errors are generated, it is very important.

Error How to Make

For the current system, the error generated by the introduction of three places:

1. Illegal parameter level system introduced. For the introduction of the illegal parameter error, the error can be checked to intercept and check the parameters preconditions;

2. interact with the underlying system error generated. And errors generated by the underlying interaction, there are two:

       . A lower system process is successful, but communication error, and this will lead to data inconsistency between the subsystems; this case, a timeout mechanism may be employed to compensate in advance the task record, subsequent to the timing data revision task over by . Better design?

     B. successful communication, but the lower the error processing. In this case, the underlying need to communicate with developers, coordinate the interactions between subsystems; the need to make the appropriate treatment or give a reasonable error message based on the error code and description of the underlying returned.

In either case, we must assume that the underlying system reliability in general, good design considerations error.

3. The layer system processing error.

This produced a layer system error:

One reason: negligence

Negligence refers to the programmer the ability to completely avoid such errors but did not actually do it. Such as the knock would be & &&, become knocking = ==; boundary errors, the composite errors logical judgment. Negligence or programmers lacked concentration, such as in a state of fatigue, overtime all night, while sitting and writing program; either a hurry to realize the function, not take into account the program's robustness and so on.

Improvements : use the code static analysis tool, line coverage by unit tests can effectively avoid the problem.

Two reasons: the error and exception handling due to lack of attention

For example, the input problem. Calculate two numbers together, not only to consider computing overflow, but also consider the case of illegal input. For the former, possibly through understanding, experience or mistakes can be avoided, and for the latter, it must be limited so that it is within the scope of our intelligence can control, such as using regular expressions to filter out invalid input. For the regular expression must be tested. For illegal input, as far as possible to give a detailed, easy to understand, friendly message, causes and recommend solutions.

Improvements : consider the various error conditions and exceptions as thoughtfully. After the realization of the main flow, an additional step: careful scrutiny of a variety of possible errors and exceptions, reasonable returns an error code and error description. Each interface module is valid or handle their own errors and exceptions, can effectively avoid the bug caused by a complex interaction scenario. For example, a business use case scenario performed by ABC interaction. AB actual execution succeeded, failed C, then B and the need to return reasonable code C according to the message and returns to the rollback A reasonable code message and, according to the return B A rollback, and returns to the client reasonable code and message. This is a segmented rollback mechanism is required in each scene Rollback exceptions must be considered.

Three reasons: tight coupling logic leads

As the business logic tightly coupled, with the step by step development of software products, all kinds of logical relations are complicated, difficult to see the global situation, leading to local influence spread to modify global scope, cause unpredictable problems.

Improvements: Short functions and write short methods, each function or method preferably not more than 50 lines. Stateless write functions and methods, read-only global state, and the output is always a prerequisite for the same result, do not rely on external state changes their behavior; reasonable definition of the structure of the interface and logic section, so that the interface between the possible orthogonal interaction, low coupling; service layer to provide as simple as possible, orthogonal to the interface; continuous remodeling, maintaining modularity and loosely coupled applications, sort logic dependencies. For a large number of service interfaces influence each other, the processes of the business logic must arrange interfaces and interdependencies, to optimize a whole; state for a large number of entities, but also need to sort out the relevant transition between the service interface, finishing state relationship.

Four reasons: the algorithm does not lead to incorrect

Improvements: First, the algorithm is separated from the application. If multiple algorithm implemented, can be found by testing out cross validation unit, such as sorting operation; if the reversible nature of the algorithm can find out by checking the reversible unit tests, such as encryption and decryption operations.

Five reasons: the same type of parameters passed in order errors

For example, modifyFlow (int rx, int tx), the actual call is modifyFlow (tx, rx)

Improvement: Type embodied as possible, with the floating-point number in floating point, string on the string, with this particular type of object on a particular object type; the same offset type as parameters; if all the above can not be met, it must be verified by testing the interface, the interface parameter values ​​mUST be different.

Six reasons: null pointer exception

Null pointer exception is usually the object is not properly initialized, or no object is non-empty to do testing before using the object.

Improvements: For configuration objects, detect if it successfully initialized; for ordinary objects, acquired entity object before use to detect whether a non-empty.

Seven reasons: Network communication error

Network communication error usually wrong because the network delay, obstruction or barrier caused. Network communication errors are usually small probability event, but a small probability event is likely to result in failure of a large area, it is difficult to reproduce the BUG.

Improvement: the front end point of a subsystem and a subsystem entry points each after playing INFO log. Provide a clue by the time difference between the two.

The reason eight: Transaction and concurrency errors

Concurrent with the transaction together, it is prone to error very difficult to locate.

Improvements: For programs in concurrent operations involving shared variables and important status changes, and to raise INFO log. A more effective approach? ? ?

The reason Nine: configuration error

Improvements: when you start an application or start the appropriate configuration, detects all configuration items, print the appropriate INFO log to ensure that all configurations are loaded successfully.

The reason ten: errors caused by unfamiliar business 

In large systems, part of the business logic and business interactions are more complex, the entire business logic can exist in multiple brain development of the students, everyone's understanding is not complete. This can easily lead to business coding errors.

Improvements: by more than discussion and communication, design the right business use cases, according to the business to write and implement business logic use cases; the ultimate business logic and business use cases must be complete archive; preconditions indicate the service in a service interface, processing logic, and precautions rear checksum; when the traffic changes need be updated synchronously service comment; Code REVIEW. Business Notes is an important business document interface, the business plays an important role in understanding the cache.

The reason XI: design error caused problems

For example, there will be performance synchronous serial manner, the problem of slow response, asynchronous and concurrent performance can be solved, the problem of slow response, but it will bring security, accuracy of risk. Asynchronous programming model can result in changes, add new ones and receive asynchronous message push and so on. We can use caching to improve performance, but will there is a problem cache update.

Improvements: writing and careful review of design documents. Design documents must elaborate backgrounds, needs, business goals are met, to achieve business performance metrics may impact the overall design ideas, detailed program, the program foresee the advantages and disadvantages and possible impacts; pass the test and acceptance, to ensure that change design does meet business objectives and business performance metrics.

The reason 12: errors caused by unknown details

Such as buffer overflows, SQL injection attacks. From a functional point of view there is no problem, but the use of malicious point of view, is vulnerable. Another example, choose to do jackson library JSON string parsing, by default, when a new field object can cause parsing errors. It must be added @JsonIgnoreProperties (ignoreUnknown = true) annotation to respond to change properly on the object. If you choose other JSON library will not have this problem.

Improvements: on the one hand through experience, on the other hand, consider the security issues and exceptions, select mature rigorously tested library.

The reason Thirteen: over time the emergence of bug

Some solutions in the past appears to be very good, but it may become awkward or even useless in the current or future scenarios, it is a common thing. Such as encryption and decryption algorithm, in the past may be considered to be perfect, we must use caution after the break.

Improvements: pay attention to changes and bug fixes news, timely correction of date code, libraries, behavior.

The reason 14: hardware related errors

Such as memory leaks, lack of storage space, OutOfMemoryError and so on.

Improvements: increase application performance monitoring system is an important indicator of CPU / memory / networks.

Common errors occurring in the system:

  1. Entity records in the database does not exist, which must be specified entity or entity identifier;
  2. Entity is not configured correctly, you must specify which configuration problem, what should be the proper configuration;
  3. Physical resources does not meet the criteria, you must specify what current resources are, what resources are required;
  4. Entity operating pre-conditions are not met, what preconditions must specify the need to meet, what is the current state;
  5. After the calibration operation of the entity set is not satisfied, the rear check must indicate what needs to satisfy, what is the current state;
  6. Performance issues cause timeouts, must indicate what caused performance problems, how to optimize follow-up;
  7. Error status or result in inconsistent data between multiple subsystems interactive communication?

Generally difficult to locate errors occur in relatively low-level place. Because the underlying can not predict specific business scenarios, the error message given are relatively common.

This requires providing clues as rich as possible in the upper business. Wrong is a certain level of interaction or multiple systems that do not meet the pre-conditions on a layer stack cause. When programming, each layer in the stack must be taken to ensure that all preconditions are satisfied as much as possible to avoid the wrong parameters passed to the bottom, as much as possible error in the business layer intercepted.

Most errors are combined to produce a variety of reasons. But every error must have a cause. After resolving the error, in-depth analysis of how the error occurred, and how to avoid them happening again. Efforts can be successful, however: to reflect in order to progress!

How to write the error log easier to troubleshoot problems

Beat error log basic principles:

  1. As complete as possible. Each full description of the error log are: what went wrong under what scenario, what is the reason (or what possible reason), how to solve (or resolve tips);
  2. As specific as possible. For example, NC inadequate resources, lack of what specifically refers to what resources, can indicate directly through the program; common errors, such as VM NOT EXIST, to indicate what happened at the scene, may facilitate the work of the follow-up statistics.
  3. As directly as possible. The error log should be the best people in the first instinct is to know what causes, how to solve, rather than have to go through several steps to find the real reason.
  4. The experience has been integrated directly into the system. All problems have been solved and have experience in a friendly way as possible integrated into the system, to the new staff and better tips, rather than buried elsewhere.
  5. Layout should be clean and orderly, unified standardized format. Dense, essay-style log looked on worried, quite unfriendly, not easy to troubleshoot.
  6. Unique identification request using multiple keywords, the keywords highlighted: time, entity identification (such VmName), operation names.

The basic steps to troubleshoot the problem

Log on to the application server -> Open Log File -> navigate to the error log locations -> according to guidance cues error logs to troubleshoot, identify problems and solve problems.

among them:

  1. From the landing to open the log file: Since more than one application server, log on to go up one by one to see it is not convenient. You need to write a tool on the AG View all server logs directly on AG, or even directly screening out the error log needed.
  2. Locate the error log location. Currently log dense layout, easy to navigate to the error log. Generally can first use of "time" to locate the error log near the front of the place, and then use the entity keyword / name combination lock operation where the error log. According to locate the error log requestId although more in line with tradition, but first find requestId, and is not descriptive. It is best to directly locate the error log locations based on time / content keywords.
  3. Analyze the error log. The contents of the error log is best to be more straightforward, and the problem was clearly identified characteristics of the current investigation is to be consistent and give important clues.

Typically, the problem application error log is the log content is to understand the context for the current code, looks simple, but always write incomplete; once you leave the code situation, it is difficult to know exactly what is said, people have to think about or to look at the code in order to understand what is the meaning of the log says. This is not to their own sins?

Error log should do: even leave the code in context, and can clearly describe what happened.

In addition, if able to explain clearly the reasons directly in the error log, log in to do the inspection when you can save more energy.

In a sense, the error log can also be a very useful document, a record of all kinds of illegal operation of use cases.

At present the contents of error log may be the following problems:

1. error log does not indicate the error parameters and content

Solution: This usually requires written readable on the target DO toString method.

2. Error scene is not clear

Solution: Error message when coupled with words, or add the interface [name], indicate the error before the error message scenes directly from the error log to know to understand. Generally able to know the executor of the interface can add [name], service plus when words.

3. Content is not clear, or its meaning unknown

Solution: clearer aptly describes the error content.

4. troubleshooting guide content is not clear

Solution: Add the appropriate background knowledge and leading the investigation measures.

The error is not specific enough and detailed content

Solution: program or by improving skills, as revealed differences in the specific location, reduce operating manual alignment.

in conclusion

The error log is an important means to troubleshoot the problem. When programming a feature we usually consider a variety of errors - and why that may occur.

To troubleshoot the corresponding reason, we need to locate the cause of some of the key description. This will form a triad: Symptom -> Error Key Description -> the ultimate cause of the error.

So we need to provide programming for each key error as the corresponding error description, as complete as possible, specific and direct instructions what went wrong under any scenario, by what causes and what measures or steps to be adopted.

 

Source: www.liangsonghua.me

Author: Jingdong Senior Engineer - Liang Songhua, in-depth understanding of the stability of security, agile development, JAVA advanced, micro-service architecture

Focus on micro-channel public number: preserved egg blackboard, get more exciting!

 

 

 

Guess you like

Origin www.cnblogs.com/liangsonghua/p/www_liangsonghua_me_31.html