Prometheus monitoring usage practice

  1. Introduction to Prometheus

            Prometheus is an open source monitoring and alerting system (including the time series database TSDB), which has been adopted by many companies and organizations since 2012. The features of Prometheus are as follows:

  • Multidimensional data model (time series data consists of metric name and a set of key/value)
  • Flexible query language ( PromQL ) on multiple dimensions
  • Does not rely on distributed storage, single master node works.
  • Collect time series data through HTTP-based pull
  • Time series data push (pushing) can be done through an intermediate gateway
  • The target server can be achieved through discovery service or static configuration
  • Multiple visualization and dashboard support

        Among them, the Prometheus ecosystem can be composed of multiple components, most of which work independently, and can selectively configure the services they need, mainly as follows:

  • Prometheus main service, used to fetch and store time series data
  • The client library is used to construct application or exporter code (go, java, python, ruby)
  • push gateway can be used to support short connection tasks
  • Visual dashboard (two options, promdash and grafana. The current mainstream choice is grafana.)
  • Experimental alarm management terminal (alertmanager, separate alarm summary, distribution, shielding, etc.)

      

     2. Prometheus simple deployment (Linux -Centos )

           First download the corresponding installation package on the official website ( https://prometheus.io/download/ ) and decompress it in Linux. The specific execution command is as follows:

tar xvfz prometheus-*.tar.gz
cd prometheus-*

           There is a prometheus.yml file in the prometheus file directory, which uses the main configuration file for the entire Prometheus operation. The default configuration includes most of the standard configuration and automatic control configuration:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

        The job-name is the name of the monitored object, and each job-name cannot be repeated. The targets parameter under static_configs is critical and determines the monitored service address. In the prometheus file directory, the following commands can be used to start and close the service.

//启动prometheus服务
nohup ./prometheus --config.file=prometheus.yml

//查询启动的prometheus服务
ps -ef |grep prometheus

//关闭prometheus服务
kill -9 {prometheus-id}

        After starting the service, we can access  http://localhost:9090 in the virtual machine to view the monitoring status, or access it remotely through the mapped address of the virtual machine. roughly as shown below

     3. PromQL syntax record in Prometheus

        In the first query box above, the collected data can be processed and displayed through the PromQL statement, so the commonly used PromQL syntax is recorded as follows.

    Common matches :

+,-,*,/,%,^(加,减,乘,除,取余,幂次方)
==,!=,>,<,>=,<=(等于,不等于,大于,小于,大于等于,小于等于)

    Common functions :

sum(求和),min(取最小),max(取最大),avg(取平均),count (计数器)
stddev (计算偏差),stdvar (计算方差),count_values(每个元素独立值数量),bottomk (取倒数几个),topk(取前几位)

    Specific use :

查询指标name为http_requests_total   条件为job,handler 的数据:
http_requests_total{job="apiserver", handler="/api/comments"}

取5min内 其他条件同上的数据:
http_requests_total{job="apiserver", handler="/api/comments"}[5m]

匹配job名称以server结尾的数据:
http_requests_total{job=~".*server"}

匹配status不等于4xx的数据:
http_requests_total{status!~"4.."}

查询5min内,每秒指标为http_requests_total的数据比率:
rate(http_requests_total[5m])

根据job分组,取每秒数据数量:
sum(rate(http_requests_total[5m])) by (job)

取各个实例的未使用内存量(以MB为单位)
(instance_memory_limit_bytes - instance_memory_usage_bytes) / 1024 / 1024
以app,proc为分组,取未使用内存量(以MB为单位)
sum( instance_memory_limit_bytes - instance_memory_usage_bytes) by (app, proc) / 1024 / 1024

假如数据如下:
instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="elephant", proc="worker", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="turtle", proc="api", rev="4d3a513", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="fox", proc="widget", rev="4d3a513", env="prod", job="cluster-manager"}

以app,proc为分组,取花费时间前三的数据:
topk(3, sum(rate(instance_cpu_time_ns[5m])) by (app, proc))

以app分组,取数据的个数:
count(instance_cpu_time_ns) by (app)

http每秒的平均响应时间:
rate(basename_sum[5m]) / rate(basename_count[5m])

        In the future, we will continue to update the practice records of prometheus cooperating with other components and monitoring various services. If there are deficiencies or loopholes in this article, please feel free to advise in the comments.

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325034268&siteId=291194637