Application monitoring using buried point method

In the world of indicator monitoring, there are two typical means of monitoring at the application and business levels. One is to bury points in the application, and the other is to analyze logs and extract indicators from logs. The method of burying points has better performance and is more flexible, but it is somewhat intrusive to the application program. Analyzing the log is less intrusive to the application program. However, due to the long link and the need for text analysis, the performance is poor. Need more computing power support.

The so-called buried point refers to the embedded SDK (a lib library) embedded in the application, and then calls the methods provided by the SDK in some key code processes to record various key indicators. For example, if an HTTP service provides 10 interfaces, how many milliseconds it takes to process each interface can be recorded as an indicator.

The core reason is that the SDK can help us encapsulate some general calculation logic. For example, there is an indicator of the Summary type, which can provide the 99th percentile delay data of an interface. If there is no SDK, what do we need to do? Whenever this interface responds to a request, record a delay time, and then store it in a memory data structure, wait for a time interval, such as 10 seconds, sort all the delay data during this period, and then take 99 points Finally, call the interface of Pushgateway to push the bit value. Very troublesome, isn't it?

The SDK is to do these general logics. What the application needs to do is to call the SDK method to inform the SDK after getting the delayed data, saying that there is another new interface call, how much delayed data is, and the SDK can complete all the rest. matter.

The well-known cross-language burying tools in the world are StatsD and Prometheus. Of course, some languages ​​have their own ecological common tools, such as Micrometer in the Java ecosystem, but most companies use multiple languages, and these cross-language burying solutions are usually used more frequently.

A great feature of StatsD is that it uses the UDP transmission protocol. Most of the calculation logic is moved to the StatsD Server, and the work at the SDK level is very light.

The communication between StatsD SDK and StatsD Server uses UDP protocol, UDP protocol is fire-and-forget, no need to establish a connection, even if StatsD Server hangs up, it will not affect the application program, and for the calculation logic of delay distribution interval, it is The calculation in StatsD Server will not affect the application program, so the design of the whole StatsD is very lightweight and basically has no impact on the application program.

Since StatsD is quite well-known, many collectors have implemented the StatsD protocol, such as Telegraf and Datadog-agent. That is to say, the StatsD Server in the figure can be replaced with Telegraf or Datadog-agent. In this way, there is no need to deploy too many processes, and a collector package dominates the world. Take Telegraf as an example. After the replacement, the architecture becomes like this.

The burying method of Prometheus is very similar to that of StatsD. For monitoring indicators such as the number of requests and delay, it is also recorded by calling the SDK method after the request processing is completed. However, it would be too redundant to add such a few lines of code to each method. It is better to do some aspect logic through AOP, which is what Nightingale's Webapi module does.

The function of Webapi is to provide a series of HTTP interfaces for JavaScript calls. We need to monitor the call volume, success rate, and delay data of these interfaces. Before burying points, let's plan the tags first. We plan 4 tags for each HTTP interface.

  • service: Service name, which requires global uniqueness and can be distinguished from other services.
  • code: The status code returned by HTTP, you can know the ratio of 4xx and 5xx according to this information, and calculate the success rate.
  • path: interface path, such as /api/v1/users, sometimes we will place URL parameters in the interface path, such as /api/v1/user/23, /api/v1/user/12 is the request id is 23 and 12 user information. At this time, you cannot directly use this URL as the label value of the interface path, otherwise the granularity of this indicator will be too fine. You should set the label value of the interface path to /api/v1/user/:id.
  • method: HTTP method, GET, POST, DELETE, PUT, etc.

Using StatsD's buried point method, the data is pushed to Telegraf through UDP, and Telegraf is pushed to the back-end monitoring system. If the point is buried through Prometheus, it is to expose the /metrics interface and wait for the monitoring system to pull it. If the application is deployed on a physical machine or a virtual machine, it can be pulled directly through the local monitoring agent. If the application is deployed in a Kubernetes Pod, there are two ways to pull data, one is the sidecar mode, and the other is the central service discovery mode. The diagram below shows the Sidecar mode.

 There are two containers in Pod1 on the left. App is buried through StatsD, and then pushed to Telegraf through UDP. After receiving the data, Telegraf performs secondary calculations and pushes the results to the monitoring server. There are also two containers in Pod2 on the right, App Buried through the Prometheus SDK, the /metrics interface is exposed, and Categraf pulls data through this interface, and then pushes it to the monitoring server.

The advantage of this method is that it is more flexible, and you have the final say on how to make applications in the Pod. Even adding some authentication, metric filtering, and label extension logic to the /metrics interface will not affect other Pods. The data is pushed to the monitoring server. The components receiving data from the monitoring server can be made into a stateless cluster. Load balancing is set up in front. The whole architecture is very simple and has good scalability. Of course, the shortcomings are also obvious. Each Pod is accompanied by a Sidecar agent, which wastes resources.

 

This article is a study note for Day 12 in August. The content comes from Geek Time "Operation and Maintenance Monitoring System Practical Notes". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132241512