Use Prometheus [Notes]

Prometheus use

surroundings

Examples of command line entry

  • CPU usage calculation

    In t1 to t2 CPU total use time period = ( user2+ nice2+ system2+ idle2+ iowait2+ irq2+ softirq2) - ( user1+ nice1+ system1+ idle1+ iowait1+ irq1+ softirq1)
    CPU use time in an idle period t1 to t2 =(idle2 - idle1)

    The CPU time instant t1 to t2 = utilization 1 - CPU空闲使用时间 / CPU总的使用时间

    increase() Function: to solve the counter type of time increment

    Multicore CPU calculates

    sum()The results are summed

    • Get CPU time
    • Get free timeidle

The total acquisition time

  • The total CPU utilization on a single machine

    1-(sum(increase(node_cpu_seconds_total{instance="192.168.9.232:9100",mode="idle"}[1m]))/sum(increase(node_cpu_seconds_total{instance="192.168.9.232:9100"}[1m])))
    
  • by (instance): differentiate between instances of

  • (1-( sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance) )) * 100
    
  • Other CPU state is calculated using the time

    • iowait io latency

      sum(increase(node_cpu_seconds_total{mode="iowait"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • irq hardware interrupt

      sum(increase(node_cpu_seconds_total{mode="irq"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • soft irq soft interrupt

      sum(increase(node_cpu_seconds_total{mode="softirq"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • Time slicing virtual machine steal

      sum(increase(node_cpu_seconds_total{mode="steal"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • nice nice value of time allocation process

      sum(increase(node_cpu_seconds_total{mode="nice"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • idle idle

      sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • user mode user

      sum(increase(node_cpu_seconds_total{mode="user"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

    • sytem kernel mode

      sum(increase(node_cpu_seconds_total{mode="system"}[1m])) by(instance) / sum(increase(node_cpu_seconds_total{}[1m]) ) by(instance)

Extended use of the command line

  • filter

    • Label Filtering key{label=""}
      • Fuzzy matching key{label=~"web.*"}
    • Value filter
      • Arithmetic key{.} > 400
  • function

    • rate(.[5m])With data type counter, according to a set time period, taking counterthe average of the increments per period

      v a l u e = S / t value = ∆S/∆t

      • Value of the time period to be considered acquired program data collection interval
    • increase(.[5m])With the counterdata type, the increment takes a period of time
      v a l u e = S value=∆S

    • sum()Addition sum

      • Combine by()
    • topk(x,key) Take the highest position before x

      • Not suitable graph; suitable for consoleviewing
      • Suitable for instantaneous alarm
    • count()

      • Fuzzy surveillance judge

Data collection

Start the server - for production

  • Peometheus load configuration file

    • For signaling to prometheus
      • kill -HUP pid
    • Send HTTP requests to prometheus
      • curl -XPOST http://prometheus.chenlei.com/-/reload
  • Background process

    • Use screenTools

    • use daemonize

      > yum install -y kernel-devel 
      > yum groupinstall -y Development tools
      > git clone https://github.com/bmc/daemonize.git
      > cd daemonize
      > ./configure && make && make install 
      
  • Start prometheusadditional parameters

    • -web.listen-address: listen address 0.0.0.0:9090
    • -web.read-timeout: The maximum waiting time request link 2m
    • -web.max-connections: Maximum number of connections 10
    • -storage.tsdb.retention: data retention 90d
    • -storage.tsdb.path: Data save path /data/prometheus/server/data
    • -query.max-concurrency: the maximum number of concurrent 20
    • -query.timeout: Query timeout 2m
  • Storage structure

    server/
    └── data
        ├── 01DM9HP1PHHK2BD1MGC7J1C0YC
        │   ├── chunks
        │   │   └── 000001
        │   ├── index
        │   ├── meta.json
        │   └── tombstones
        ├── 01DM9ZDG8QKWTPYZ86K7XW6FKZ
        │   ├── chunks
        │   │   └── 000001
        │   ├── index
        │   ├── meta.json
        │   └── tombstones
        ├── 01DMAM0NM51YSQ4EVRRV46X2E1
        │   ├── chunks
        │   │   └── 000001
        │   ├── index
        │   ├── meta.json
        │   └── tombstones
        ├── 01DMAM0P4CGJWSSA15QPWJGZXF
        │   ├── chunks
        │   │   └── 000001
        │   ├── index
        │   ├── meta.json
        │   └── tombstones
        ├── lock
        ├── queries.active
        └── wal
            ├── 00000011
            ├── 00000012
            ├── 00000013
            ├── 00000014
            ├── 00000015
            ├── 00000016
            ├── 00000017
            ├── 00000018
            └── checkpoint.000010
                └── 00000000
    
  • Recent data stored in the wal/directory, to prevent a sudden power failure or reboot, to be used to recover data in memory

Server configuration file written

global:
  scrape_interval:     5s #抓取频率
  evaluation_interval: 1s 



alerting:
  alertmanagers:
  - static_configs:
    - targets:



rule_files:

scrape_configs:	

  - job_name: 'prometheus'

    static_configs:
    - targets: ['localhost:9090']

  - job_name: '233-node-exporter'

    static_configs:
    - targets: ['192.168.9.233:9100']

  - job_name: '232-node-exporter'

    static_configs:
    - targets: ['192.168.9.232:9100']

  - job_name: '239-node-exporter'

    static_configs:
    - targets: ['192.168.9.239:9200']

node_exporter

github address

  • Indicators collection server
  • There are enough default collection items
  • You can start, open or disable certain indicators

pushgateway

  • Introduce
    the initiative to push dataprometheus server

    Can be run separately on different nodes, the node monitoring is not required

  • installation

  • Custom scripts sent to collect pushgateway

    • Installation pushgeteway

    • prometheus configuration job related pushgateway

    • Scripting data collection target host

    • The timing metric data to perform transmission pushgateway

      #!/bin/bash
      instance_name=instance_name
      
      label=label
      value=123
      
      echo "$label $value" | curl --data-binary @- http://192.168.9.233:9091/metrics/job/test/instance/$instance_name
      
  • Shortcoming

    • A single point of bottleneck
    • No data filtering

Custom exporter

Interface visualization

grafana

  • Introduction
    of open source data mapping tools

  • installation

  • Configuration

    • Adding prometheusdata sources

    • Add todashboard
      image-20190910175959926

    • Establish Dashboard

      • Data source configuration
        image-20190910175917439
  • Graphical Configuration

    • Visualization
      image-20190910175754722
    • Axes
      image-20190910175814132
    • Legend
      image-20190910175830342
    • Thresholds & Time Regions
      image-20190910175852153
    • Data link
  • Spoken arrangement
    image-20190910180031703

  • Alarm Configuration
    image-20190910180042705

  • Backup

    • Export json
    • save as
  • reduction

    • Import json / paste json
  • Alarm
    Alarm is a grafana 4.0new feature

    • Nail alarm
    • pageduty

practice

  • Memory Usage

    • Source
      node_exporter
    • The formula
      v a l u e = a v a i l a b l e / S u m value=available/Sum
      Actual available memory = free + buffers + cached
    • Formula achieve
      ((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes)*100
  • Monitoring hard disk io

    • Source
      node_exporter
    • The formula
      v a l u e = + value = read + write speed speed
    • 公式实现
      函数: predict_linear(), 预测趋势
      (rate(node_disk_read_bytes_total[1m])+rate(node_disk_written_bytes_total[1m]))
  • 网络监控

    • 数据来源
      bash脚本+pushgateway

    • 脚本编写
      采集内网流量ping延迟和丢包率

      instance=`hostname -f`
      #外网联通
      lostpk=`timeout 5 ping -q -A -s 500 -W 1000 -c 100 baidu.com | grep transmitted | awk '{print $6}'`
      #时间
      rrt=`timeout 5 ping -q -A -s 500 -W 1000 -c 100 baidu.com | grep transmitted | awk '{print $10}'`
      
      # value只允许数值型
      value_lostpk=${lostpk%%\%}
      value_rrt=${rrt%%ms}
      
      # 通过 pushgateway 发送给prometheus
      echo "lostpk_$instance : $value_lostpk"
      echo "lostpk_$instance $value_lostpk" | curl --data-binary @- http://192.168.9.233:9091/metrics/job/network-traffic/instance/$instance
      
      echo "rrt_$instance : $value_rrt"
      echo "rrt_$instance $value_rrt" | curl --data-binary @- http://192.168.9.233:9091/metrics/job/network-traffic/instance/$instance
      
      
    • 定时执行
      资料
      定时执行步骤:

      • 安装crontab
      • /etc/crontab配置cron运行对应可执行脚本
    • 查看结果

      • 在prometheus查看targets有没有在线,如果没有需要到prometheus配置,记得刷新配置
        image-20190912115039938
      • 查看配置

      image-20190912115136770

      • 看指标,在命令行输入刚刚自定的key应该会有提示出现lostpk rrt
        image-20190912124023780
发布了161 篇原创文章 · 获赞 140 · 访问量 47万+

Guess you like

Origin blog.csdn.net/qq_37933685/article/details/100767261