After the link log is enhanced, I am no longer afraid that the leader will report an error in his own certain function in the group.

At work, I believe that one of the things that everyone fears most is to hear someone in the workgroup at you: XX function reports an error. . .

Then you have to go to the server to look at the logs. A small amount of logs is better. If there are more, it will be too troublesome to find. It is not easy to locate key places directly.

Looking east and west, I finally found the error message, but I don’t know what the parameter information was at that time, and it’s not easy to reproduce, too difficult. .

You have to write a fault report after the correction, and the good day is gone.

To solve this kind of pain points, you need to do the following things:

  • Log collection
  • Abnormal warning
  • Log increase link
  • API response increase traceId
  • Print the parameters of the current error method when abnormal
  • Support debugging mode

Log collection

The first problem to be solved is the centralized management of logs. Otherwise, if you report an error, you will have to find error information on multiple services, which is too inefficient. Of course, you can also use a tool like ansible to do it. The best thing is to collect log statistics and search and view them through the Web page.

The benchmark for log collection is ELK, so I won't explain it too much. Like the cloud service we use, it is more convenient to collect, just click on the page to get it.

Log increase link

Adding the link tracking function to the log is divided into two steps. First, the system must have link tracking, and then link information is integrated into the log.

I use Spring Cloud Sleuth, mainly because Sleuth supports many open source frameworks and also integrates logging frameworks such as logback, which is very convenient to use.

Sleuth's default enhanced log format is as follows:

[${spring.zipkin.service.name:${spring.application.name:-}},%X{X-B3-TraceId:-},%X{X-B3-SpanId:-},%X{X-Span-Export:-}]

They are the service name, link ID, work unit ID, whether to import into zipkin. The most important thing for logs is traceId. With traceId, all system logs can be connected in series.

We can also expand by ourselves and add some other information to the log. such as:

%X{X-REST-API:-},%X{X-RPC-SERVICE:-},%X{X-ORIGIN-INFO:-},%X{X-USER-ID:-},%X{X-BIZ-NAME:-},%X{X-BIZ-ID:-}
  • X-REST-API: Entry API, global transparent transmission
  • X-RPC-SERVICE: Entry RPC, added at the entry of each service
  • X-ORIGIN-INFO: source information (caller application name: IP: service name)
  • X-USER-ID: User ID, global transparent transmission
  • X-BIZ-NAME: business name, global transparent transmission, replaceable in application
  • X-BIZ-ID: Business ID, global transparent transmission, replaceable in application

With these extended information, you can directly know from the log which entry API the current request is and which services the entire request has passed through.

If I were the person in charge of the order service, when I went to troubleshoot the problem, I would know from the log that the current error was caused by which upstream system and which interface call.

The log also includes user information, knowing which user requested it.

BIZ-ID and BIZ-NAME can be used to troubleshoot business scenarios. For example, after placing an order, you will know the order ID, and then you can add the order ID to the log, BIZ-NAME=order, BIZ-ID=20102121212121.

There are order-related operations such as payment, delivery, refund, etc. The order ID is in the log. When you need to troubleshoot, you can see the log information related to the entire order directly based on the order ID, provided that the information is typed.

Abnormal warning

In addition to user feedback, the development should also know that there is a problem in the first time. Therefore, abnormal alarms must be done.

Generally our applications are divided into: service applications, job applications, asynchronous consumer applications

For service applications, we can alert in the unified exception handling. Job applications can also alert at the entrance of unified scheduling. The same is true for asynchronous consumption.

You can use the message queue to alert, or you can formulate the log format and record the log to collect the log on the log platform, and then configure various rules to alert.

You can bring the traceId when an abnormality is reported, so that when an abnormality is found, you can directly search the log platform through the traceId, and you can see all the logs related to the traceId, which is very helpful for troubleshooting, provided that you print the key Log information.

API response increase traceId

The response results can be uniformly customized through ResponseBodyAdvice, and traceId responses can be added. In this way, when a problem occurs, you can directly go to the log platform to search the log through the traceId of the response.

In addition to quickly troubleshooting abnormal issues, when optimizing performance, we can also directly view the time-consuming situation of this API according to the traceId, provided that the traceId is consistent with your APM system.

Print the parameters of the current error method when abnormal

Through the previous operation, we can already obtain a traceId to check related error information when abnormal, and there is no need to go to multiple machines to find logs randomly, which greatly improves the speed of problem solving.

It can only be said that these operations have helped half of our troubleshooting. For example, I now receive an alert, and then I went to the log platform to check the related logs and found that a line was reported wrong.

At this time, I can only guess that this place is problematic, because I don't know what parameter caused this line to report an error at the time. Therefore, if the parameters of the current error reporting method can be printed to the log when an error is reported, it is equivalent to retaining the scene when the problem occurred, and solving the problem is a matter of minutes.

The specific implementation plan is not fixed. The easiest way is to write an Aspect cut to all business methods. When the method throws an exception, record the parameter information. Remember to only do this recording operation when the exception occurs, otherwise it will have a great impact on performance. Big.

effect:

com.xxx.biz.service.impl.GoodsSkuServiceImpl.createSku异常, 参数信息:{"cspuId":1, stock:10, price:100}
Caused by: java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at com.xxx.biz.service.impl.GoodsSkuServiceImpl.createSku(GoodsSkuServiceImpl.java:682)

Support debugging mode

Supporting the debugging mode means that in some scenarios, we can reproduce errors, but in addition to the parameter information recorded at the time of the exception, we also want to know the parameters and responses of the entire request link. That is, all methods passing through the entrance can print out the request and response data.

You can define a specific request header, bring this request header when reproducing the problem, the unified framework will receive the request header, and then transparently transmit it on the entire link. Then combine the abnormal aspect to log the parameters and results.

effect:

xxx.xxxController.makeOrder 参数:xxx
xxx.xxxRpcService.makeOrder 参数:xxx
xxx.xxxStockRpcService.lockStock 参数:xxx
xxx.xxxStockRpcService.lockStock 响应:xxx
xxx.xxxRpcService.makeOrder 响应:xxx
xxx.xxxController.makeOrder 响应:xxx

Guess you like

Origin blog.csdn.net/linuxguitu/article/details/112866504