Talk about best practices for logging

1. Background

Logging is a cliché among us programmers, and you probably hear the word every day. I remembered that I had just joined the company when I just graduated from university, and I was doing some departmental business handover, that is, the services of other departments were handed over to us for maintenance. I remember that it didn’t take long for the handover, and the related functions of the WeChat public account in the business were unavailable at that time. The students who were in charge of this part of the business at that time were very difficult to troubleshoot the problem. The entire link did not log a single log, just the error log at the entrance, continuous. After going online several times and adding several rounds of logs, the problem was located. At that time, another example also appeared in other departments. Too many logs were typed. Due to the magnitude of business access, a large number of logs were typed out, which caused the disk IO to be full, and finally paralyzed the entire service.

Several years have passed in a flash, but the above problems are being staged in different companies and different departments. Many developers do not pay attention to the use of logs for their own convenience. It's okay, because you developed the problem, you may be able to find it at a glance, but if your code is handed over, let others maintain it. There are also some readers of the official account who will ask me some questions about the use of logs, so I will combine some of the usual experience and the "Alibaba Java Development Manual" to write down what I think are the best practices for logging.

2. Best Practices

2.1 Reasonable level division

There are 6 levels in the logging system to control the output of our logs:

  • TRACE: Online debugging, this is basically not used, it is relatively tasteless.
  • DEBUG: The log used for debugging. If the information is not very important and only needed in some extreme scenarios, then DEBUG can be used.
  • INFO: INFO information is usually used for some log output that needs to be normalized, and it needs to be used frequently. We often need this part of information to troubleshoot some business problems, then you can use INFO.
  • WARNING: Warning information, usually used for some known business errors, which can basically be handled.
  • ERROR: Error information, usually used for exceptions or errors that we cannot handle, should use ERROR for this part.
  • FATAL: Fatal error, which means that the program needs to be terminated immediately. This is rarely used, and we will not use it in business use.

Although there are 6 log levels, there are generally only 3 business levels that we need to pay attention to in real business development. TRACE and FATAL are basically not used. DEBUG is used more in the development of some basic tools. , because these basic tools print logs implicitly for the business side, so if the information is not too important, you need to use debug.

In fact, many students will mix the two types of ERROR and WARN logs. Generally speaking, when ERROR logs appear, various reminders will be sent, such as text messages. Some students always mark all errors as ERROR, such as the user does not have permission. , the user's balance is insufficient, etc., then it is inevitable that the bombing of text messages will be inevitable. In fact, this part of the error is actually part of our business process, so it should be enough to use WARN to print the log. Here is a good method to recommend to you to deal with this situation. We inherit an exception for all business exceptions. In our business, it is BizException. By capturing this part of the exception, we can log WARN.

        try{
            // do something
        }catch (BizException e){
            LOG.warn("biz exception", e);
        }catch (Exception e){
            LOG.error("exception", e);
        }

Of course, this part of the logic can also be handled through aspects, and the exception information can be inserted into the exception as much as possible, and processed uniformly.

Dynamic log level adjustment

We mentioned above that there are 6 levels of logs, but we will have an overall level selection for a log system. Usually in business, we will choose info, that is, if the log level we output is greater than or equal to info, it will be output. to the file. Let's consider the following two situations:

  • A bug is triggered in a certain business, and a large number of error logs are displayed, which not only affects the performance of the machine, but also frequently sends out error messages.
  • It is necessary to check the problem of a certain basic tool middleware, but the log level is debug. At this time, you need to go online again and modify the log level to debug to check the problem.

Both scenarios can be solved by dynamic log level adjustment. When a large number of error logs appear, the log can be closed immediately to prevent more problems caused by printing the log. When you need to troubleshoot some debug-level problems, you can directly modify it to the debug output level to meet your needs.

The methods of dynamic log level adjustment generally include the following methods:

  • If it is a version after spring-boot 1.5, introduce spring-boot-starter-actuator and modify the log level through the http interface.
  • Use arthas, modified by ognl, as shown in the following code:
ognl -c 1be6f5c3 '@org.slf4j.LoggerFactory@getLogger("root").setLevel(@ch.qos.logback.classic.Level@DEBUG)'

2.2 Reasonable use of placeholders

There is a classic interview question, as shown in the code below, can you find the problem with this code?

        LOG.info("a:" + a + "b" + b );

If our log output level is warn, then our info log will not actually be printed, but the operation of adding strings inside will indeed be executed, so in order to have this problem, many people will add a level judgment before:

        if (LOG.isInfoEnabled()){
            LOG.info("a:" + a + "b" + b );
        }

But writing like this will cause a bit of procrastination, so there is a placeholder writing method:

        LOG.info("a:{}, b:{}", a, b);

Here is a line of code to complete, here is to explain that the log service does not do anything special, the replacement of placeholders is still done through the MessageFormat class, that is, it will be traversed one by one, so here Vipshop will java The manual recommends that if it is an ERROR log, it is recommended to use the above mode of "a" + a + "b:" + b, because ERROR generally needs to be printed every time, so there is no need to worry about the splicing of String in vain . I personally think that there is no need to think too much about the performance of placeholders. You use placeholders for info. Generally speaking, the log of info occupies 99% of the log of this system, and only 1% of the error log is , so the improvement is generally not large, and it will cause people to misuse it.

Exceptions do not use placeholders

     try{
           // do something
       }catch (Exception e){
           LOG.error("exception :{}", e);
       }

Many people will write like the above when printing exceptions. When the placeholder is filled, the toString method is actually used. If the abnormal toString is called directly, the abnormal stack information will be lost, which will increase our ability to troubleshoot problems. Difficulty, so it should be emphasized here that if the print exception cannot be directly used with placeholders, but written directly at the end, the stack information of one field will be automatically printed when the log is printed.

2.3 Reasonably choose the log output method

There are two modes of log output: synchronous and asynchronous. We generally choose asynchronous mode in business. In log4j2, asynchronous mode is divided into two types:

  • AsyncAppender: Use ArrayBlockingQueue to save asynchronous logs, and then use an asynchronous thread to output.
  • AsyncLogger: Use the Disruptor framework to save logs, and then use an asynchronous thread for output.

For Disruptor I wrote an article before: Explaining Disruptor in detail , which is a high-performance queue. If we use Dirsruptor in log4j2, it will increase the log output throughput. However, we generally use Disruptor's log output mode less. When we used Disruptor to output logs in Meituan, the CPU was full. Therefore, the use of Disruptor's log output was prohibited in some places. Generally speaking, AsyncAppender can meet our use.

2.4 Reasonably keep logs

After we think about the log output method, we can consider how to save the log. Because our disk space is not unlimited, we need to consider the expired deletion of the log. The Alibaba java development manual clearly requires us to save at least 15 For the user's sensitive operations and important logs, the storage time of 6 months is required. In log4j2, the log deletion operation can also be performed through the following configuration:

<DefaultRolloverStrategy max="30">
				<Delete basePath="${LOG_HOME}/" maxDepth="2">
					<IfFileName glob="*.log" />
					<IfLastModified age="30d" />
				</Delete>
			</DefaultRolloverStrategy>

Here, it means that the logs can be saved for up to 30 days. Of course, you can also configure a scheduled task on the machine to delete it.

Sometimes we have more machines. When we are troubleshooting a problem, the most primitive method is to look at each machine one by one. Later, we start to use polysh. We only need to execute commands on one machine, and other machines will Automatic execution, but this is still a bit inconvenient, so ELK came out later. We use Logstash to collect all the logs and store them in Elasticsearch, and finally use Kibana for visual interface analysis. So using Elasticsearch to store logs is also a good way.

2.5 Reasonable output log

In our system, if a large number of invalid logs are output, it will affect the performance of our system, so we need to think about the printing of our logs, which is helpful to us, rather than printing them all at once.

When we check the problem through the log, we usually cross-service. Sometimes the information of the log is not well matched, so we need something to associate them at this time. In this case, it is traceId. We can get the log information of the entire link through a traceId, which is very convenient for our log troubleshooting.

2.6 Do not have sensitive information

In 2018, Facebook data was leaked. At that time, the entire Internet suddenly began to pay attention to the leakage of sensitive information, which is also easy to leak in the log system, such as the user's name, mobile phone number and so on. If it is printed in the log, it is easy to be misappropriated by criminals, so we should pay special attention to the problem of sensitive information when printing the log. I have also written an article about the specific log desensitization before. Teach you how to design log desensitization plugin

2.7 Reasonable log division

Many students put all the logs in the same file, which is very inconvenient for us to check the log information. We can divide the log into multiple files, such as http, rpc, mq, etc. according to different middleware. All of them can be made into independent log files, which makes it easier to generalize and search for a certain problem.

2.8 Third-party tools

Although we have talked about some protocols for so many logs, it is impossible to get the results we want perfectly every time. For example, some methods may not add logs, but some problems need to be checked. At this time, we can borrow our third-party tools, such as arthas. We use many commands of arthas, such as watch and trace, to complete our log functions, but third-party tools are not omnipotent, they can only help us check some upcoming data, Historical data still has to be guaranteed by our log system.

Summarize

Of course, the practical optimization of logs is not only based on the above points, but there are more scenarios that need to be optimized in combination with actual business. Here I hope that everyone can use the log well, so that there are no difficult problems in the world!

If you think this article is helpful to you, your attention and forwarding are the greatest support for me, O(∩_∩)O:

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132817&siteId=291194637