Metrics type

Metrics type

In the previous section we lead the reader to understand the underlying data model is Prometheus, Prometheus on the storage implementation of all monitoring samples are in the form of time-series stored in the memory of Prometheus TSDB (sequence database), and time-series the corresponding monitoring indicators (metric) is uniquely named by labelset.

All from the store in terms of monitoring indicators metric is the same, but there are some of these metric subtle differences under different scenarios. For example, the reaction of the sample index node_load1 Node Exporter returned is the current system load state, over time the sample data this indicator is returned is constantly changing. The index node_cpu the acquired sample data is different, it is a continuous increase in value, because the reaction is cumulative usage time of the CPU, in theory, as long as the system does not shut down, this value will become infinitely large.

In order to help the user to understand and distinguish the difference between these different monitoring index, Prometheus defines the different types of indicators. 4 (metric type): Counter (counter), Gauge (dashboard), Histogram (histogram), Summary (digest ).

Exporter sample data returned, which also contains the annotation type of the sample. E.g:

# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="idle"} 362812.7890625

Counter: The counter only to rise

Counter type indicator and it works the same counter, only to rise (unless the system reset occurs). Common monitoring indicators, such as http_requests_total, node_cpu are Counter types of monitoring indicators. Usually in the name of the custom Counter Type Indicator of recommended _total as a suffix.

Counter is a simple but powerful tools, such as the number of times certain events occur we can record in the application, by storing the data in the form of timing, we can easily understand that the rate of change of the event generated. PromQL polymerization operation and the built-in function the user can be further analysis of the data:

For example, () function to get the HTTP request amount by growth rate:

rate(http_requests_total[5m])

Query the current system, the forward traffic HTTP address 10:

topk(10, http_requests_total)

Gauge: increase can be reduced dashboard

Different and Counter, Gauge types of indicators to focus on the current state of the reaction system. Therefore the sample data such indicators increase can be reduced. Common indicators such as: node_memory_MemFree (Host idle current content size), node_memory_MemAvailable (available memory size) are Gauge type of monitoring indicators.

By Gauge indicators, users can view the current state of the system:

node_memory_MemFree

Gauge for monitoring index type, by obtaining a sample can be built-in functions PromQL Delta () Returns changes within a period of time. For example, the CPU calculates the temperature difference within two hours:

delta(cpu_temp_celsius{host="zeus"}[2h])

May also be used Deriv () linear regression model samples is calculated, and even directly predict_linear () to predict trends in data. For example, where the prediction system disk space remaining after 4 hours:

predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)

Using data analysis and distribution Histogram Summary

In addition to Counter and Gauge types of monitoring indicators, Prometheus also defines the types of indicators are defined and Summary of Histogram. Histogram and Summary for the primary distribution of statistics and analysis of samples.

In most cases people tend to use the average of certain quantitative indicators, such as average response time average CPU utilization, page. The problem with this approach is obvious, the average response time to system API calls, for example: if the majority of API requests are maintained within the range of 100ms response time, response time and request the individual needs 5s, then it will lead to some WEB page response time falls median situation, and this phenomenon is known as long-tail problem.

In order to distinguish the average long tail or slow slow, the easiest way is to delay requests are grouped according to the range. For example, the number of statistical delay between a request number 0 ~ 10ms and the number of requests between 10 ~ 20ms and how much. It can quickly analyze the causes slow system in this way. Histogram and Summary are intended to be able to solve this problem, by Histogram and Summary types of monitoring indicators, we can quickly understand the distribution of monitoring samples.

For example, indicators Type Indicator prometheus_tsdb_wal_fsync_duration_seconds for Summary. It records the processing time Prometheus Server in wal_fsync processed by visiting Prometheus Server's / metrics addresses can access the following monitoring sample data:

# HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of WAL fsync.
# TYPE prometheus_tsdb_wal_fsync_duration_seconds summary
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.5"} 0.012352463
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.9"} 0.014458005
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.99"} 0.017316173
prometheus_tsdb_wal_fsync_duration_seconds_sum 2.888716127000002
prometheus_tsdb_wal_fsync_duration_seconds_count 216

From the above total number of samples that can be currently Prometheus Server wal_fsync operation 216 times, consuming 2.888716127000002s. Wherein the median (quantile = 0.5) is time-consuming 0.012352463,9 quantile (quantile = 0.9) is time-consuming 0.014458005s.

In the sample data Prometheus Server itself is returned, we'll find the type of monitoring indicators prometheus_tsdb_compaction_chunk_range_bucket Histogram.

# HELP prometheus_tsdb_compaction_chunk_range Final time range of chunks on their first compaction
# TYPE prometheus_tsdb_compaction_chunk_range histogram
prometheus_tsdb_compaction_chunk_range_bucket{le="100"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="1600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="6400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="25600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="102400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="409600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="1.6384e+06"} 260
prometheus_tsdb_compaction_chunk_range_bucket{le="6.5536e+06"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="2.62144e+07"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="+Inf"} 780
prometheus_tsdb_compaction_chunk_range_sum 1.1540798e+09
prometheus_tsdb_compaction_chunk_range_count 780

Summary and similar types of indicators that the same types of samples Histogram react Total current record index (in _count as a suffix), and the total value (in _sum as a suffix). Histogram indicators except that the number of direct reaction of the sample at different intervals, intervals defined by the tag len.

While the indicator for Histogram, we can calculate the quantile function value by histogram_quantile (). Histogram except that histogram_quantile by Quantile function is computed on the server side. The Sumamry quantile is the direct calculation is done on the client side. So for computing quantile, Summary better performance when queried by PromQL, Histogram and more resources will be consumed. On the contrary the client in terms of resource consumption less Histogram. Users should be selected according to their actual scene in the choice of these two methods.

Guess you like

Origin www.cnblogs.com/pythonPath/p/11267262.html