eBay Deng Ming: dubbo-go in the design metrics

Recently because of the metrics to achieve a similar function in the Apache / dubbo-go (hereinafter referred to as dubbo-go) inside, then spent a lot of time to understand now Dubbo inside of metrics is how to achieve. The part, which is actually being placed in a separate project, namely metrics .

Overall, Dubbo's metrics from design to implementation is a very good module, in theory, most of the Java project metrics that can be used directly. But also because the balance of various non-functional characteristics of the performance, scalability, so at first glance, the code will not start kind of feeling.

This article will discuss today from the larger concepts and abstract design dubbo-go about the metrics modules - in fact it is the design of metrics in Dubbo. Because I just will Dubbo inside a copy of the relevant content in dubbo-go in.

Currently dubbo-go of metrics has just started, the first PR, click here .

The overall design

Metric

To understand the metrics of the design, we must first understand what we need to collect some data. We can easily enumerate the various indicators we are concerned in which RPC areas, such as the number of calls for each service, response time; if a little more detailed, there are a variety of response time distribution, the average response time, 999 line ......

However, it listed above is divided from the content data. metrics in the abstract, it is abandoned this division, but the combination of features and performance data in the form of an integrated division.

From source which is very easy to find abstract this division.
lADPDgQ9rbc0tAvMqs0B2g_474_170_jpg_620x10000q90g

metrics designed Metric abstract interface as the top of all the data:

In Dubbo inside, it is more crucial sub-interfaces:

lADPDgQ9rbc0tA3NASrNAw4_782_298_jpg_620x10000q90g

For us to understand, here I copied what use these interfaces:

  • Gauge: a real-time measurement data, reflects the transient data, not are additive, for example, the current number of threads in the JVM;
  • Counter: a counter index type, suitable for recording the total amount of other types of call transactions;
  • Histogram: Histogram distribution index, for example, may be used to count the response time of an interface, it can show 50%, 70%, 90% of which falls within the request response time interval;
  • Meter: throughput metric for a period of time meter. For example, within one minute, five minutes, qps indicators within fifteen minutes;
  • Timer: Timer Meter + Histogram of equivalent composition, while statistical piece of code, a method of QPS, and the distribution of the execution times;

Currently dubbo-go only achieved FastCompass, it is also Metric sub-categories:

lALPDgQ9rbc0tBLNArXNAo0_653_693_png_620x10000q90g

This interface function is very simple, that is the number of times and response times subCategory within the collection period of time to perform. subCategory is a relatively broad concept, whether in Dubbo or in dubbo-go inside, a typical subCategory would be a service.

Design The point here is that it is to do these abstract data from what angle.

Many people in the development of such data collection systems or related functions, is to do the most vulnerable to data from the abstract content, such as abstract interface, which way is to obtain the number of calls to the service, or the average response time.

This abstraction is not impossible, especially in a system which is simple, yet very easy to use. Only in the versatility and scalability to be a lot worse.

MetricManager

After we define Metric, it is easy to think of, I have a thing to manage these Metric. This is MetricManager - corresponds to Dubbo inside IMetricManager interface.

MetricManager very simple interfaces currently in dubbo-go inside:

lADPDgQ9rbc0tBXMv80C9g_758_191_jpg_620x10000q90g

In essence, I was in a subclass of those Metric mentioned earlier, can get from this MetricManager inside. It is the only external entrance.

因此无论是上报采集的数据,还是某些功能要用这些采集的数据,最重要的就是获得一个 MetricManager 的实例。例如我们最近正在开发的接入 Prometheus 就是拿到这个 MetriManger 实例,而后从里面拿到 FastCompass 的实例,而后采集这些数据:

lALPDgQ9rbc0tBfNAlnNAaY_422_601_png_620x10000q90g

MetricRegistry

MetricRegistry 是一个对 Metric 集合的抽象。 MetricManager 的默认实现里面,就是使用 MetricRegistry 来管理 Metric 的:

lADPDgQ9rbc0tBjMus0Brg_430_186_jpg_620x10000q90g

所以,本质上它就是提供了一些注册 Metric 然后再从里面捞出来的方法。

于是,这就有一个问题了:为什么我在有了 MetricManager 之后,还有有一个MetricRegistry?似乎这两个功能有些重叠?

答案大概是两个方面:
1、除了管理所有的 Metric 之外,还承担着额外的功能,这些功能典型的就是 IsEnabled 。而实际上,在未来我们会赋予它管理生命周期的责任,比如说在 Dubbo 里面,该接口就还有一个 clear 方法;
2、 metrics 里面还有一个 group 的概念,而这只能由 MetricManager 来进行管理,至少交给 MetricRegistry 是不合适的。

metrics 的 group 说起来也很简单。比如在 Dubbo 框架里面采集的数据,都会归属于 Dubbo 这个 group 。也就是说,如果我想将非框架层面采集的数据——比如纯粹的业务数据——分隔出来,就可以借用一个 business group 。又或者我采集到的机器自身的数据,可以将其归类到 system 这个 group 下。

所以 MetricManger 和 MetricRegistry 的关系是:

lALPDgQ9rbc0tBnNAfDNA1U_853_496_png_620x10000q90g

Clock

Clock 抽象是一个初看没什么用,再看会觉得其抽象的很好。Clock 里面就两个方法:

lADPDgQ9rbc0tBvMwc0Bog_418_193_jpg_620x10000q90g

一个是获得时间戳,另外一个则是获得时间周期(Tick)。比如通常采集数据可能是每一分钟采集一次,所以你得知道现在处在哪个时间周期里面。Clock 就提供了这种抽象。

很多人在实现自己的这种 metrics 的框架的时候,大多数都是直接使用系统的时钟,也就是系统的时间戳。于是所有的 Metic 在采集数据或者上报数据的时候,不得不自己去处理这种时钟方面的问题。

这样不同的 Metric 之间就很难做到时钟的同步。比如说可能在某个 Metric1 里面,采集周期是当前这一分钟,而 Metric2 是当前这一分钟的第三十秒到下一分钟的第三十秒。虽然它们都是一分钟采集一次,但是这个周期就对不上了。

另外一个有意思的地方在于,Clock 提供的这种抽象,允许我们不必真的按照现实时间的时间戳来处理。比如说,可以考虑按照 CPU 的运行时间来设计 Clock 的实现。

例子

就用这一次 PR 的内容来展示一下这个设计。

在 dubbo-go 里面这次实现了 metricsFilter ,它主要就是收集调用次数和响应时间,其核心是:

lADPDgQ9rbc0tB3NASnNAy4_814_297_jpg_620x10000q90g

report 其实就是把 metrics reports 给 MetricManager :

lALPDgQ9rbc0tDHNAaTNA7I_946_420_png_620x10000q90g

所以,这里面可以看出来,如果我们要收集什么数据,也是要先获得 MetricManager 的实例。

FastCompass 的实现里面会将这一次调用的服务及其响应时间保存下来。而后在需要的时候再取出来。

所谓的需要的时候,通常就是上报给监控系统的时候。比如前面的提到的上报给 Prometheus。

所以这个流程可以抽象表达为:

lADPDgQ9rbc0tDLNAY3NAmw_620_397_jpg_620x10000q90g

这是一个更加宽泛的抽象。也就是意味着,我们除了可以从这个 metricFilter 里面收集数据,也可以从自身的业务里面去收集数据。比如说统计某段代码的执行时间,一样可以使用 FastCompass 。

而除了 Prometheus ,如果用户自己的公司里面有监控框架,那么他们可以自己实现自己的上报逻辑。而上报的数据则只需要拿到 MetricManager 实例就能拿到。

总结

In essence, the entire metrics can be seen as a model of provider-conumer enormous.

Different data will be collected in different places and at different time points. Some people get a bit confused when reading the source code, is that these data will be collected what point in time do?

They will only be two types of time gathering point:
1, real-time acquisition. As exemplified above me metricsFilter, a call over its data was collected;
2, the other is as Prometheus. Prometheus triggered collect each method, then it will each (e.g. Meter, Gauge) over which data collection and reporting, may be referred to as a timing acquisition;

Dubbo which collected a lot of data:

lADPDgQ9rbc0tDTNAhDNASI_290_528_jpg_620x10000q90g

These specific implementation, I will not discuss, we are interested can go look at the source. Something these data, but also our dubbo-go back to continue to achieve, welcome the continued attention, or to contribute code.

Author Information: Deng Ming, graduated from Nanjing University, worked at eBay Payment department, responsible for business development refunds.

Guess you like

Origin yq.aliyun.com/articles/741770