Prometheus service alarm rules, dashboard configuration and grafana creation of users (4)

Prometheus service alarm rules and dashboard configuration (four)

Continuing from the previous monitoring tool, Prometheus service monitoring (3) , this article mainly explains the alarm rules and dashboard configuration of the prometheus configuration service and the creation of users in grafana

Service alert rule template

1. MySQL alarm rule template

cd /home/monitor/prometheus  && mkdir rules
cd rules
cat mysql_status.yml
groups:
- name: MySQL_Monitor
  rules:
  - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {
    
    { $labels.instance }} MySQL is down"
      description: "MySQL database is down. This requires immediate action!"
  - alert: open files high
    expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} open files high"
      description: "Open files is high. Please consider increasing open_files_limit."
  - alert: Mysql_High_QP
    expr: rate(mysql_global_status_questions[5m]) > 500 
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{
    
    {
    
    $labels.instance}}: Mysql_High_QPS detected"
      description: "{
    
    {
    
    $labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {
    
    { $value }})"
  - alert: Mysql_Too_Many_Connections
    expr: rate(mysql_global_status_threads_connected[5m]) > 200
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{
    
    {
    
    $labels.instance}}: Mysql Too Many Connections detected"
      description: "{
    
    {
    
    $labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {
    
    { $value }})" 
  - alert: Mysql_Too_Many_slow_queries
    expr: rate(mysql_global_status_slow_queries[5m]) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{
    
    {
    
    $labels.instance}}: Mysql_Too_Many_slow_queries detected"
      description: "{
    
    {
    
    $labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {
    
    { $value }})"   
  - alert: Read buffer size is bigger than max. allowed packet size
    expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Read buffer size is bigger than max. allowed packet size"
      description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
  - alert: Sort buffer possibly missconfigured
    expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Sort buffer possibly missconfigured"
      description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
  - alert: Thread stack size is too small
    expr: mysql_global_variables_thread_stack <196608
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Thread stack size is too small"
      description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: Used more than 80% of max connections limited 
    expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Used more than 80% of max connections limited"
      description: "Used more than 80% of max connections limited"
  - alert: InnoDB Force Recovery is enabled
    expr: mysql_global_variables_innodb_force_recovery != 0 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} InnoDB Force Recovery is enabled"
      description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
  - alert: InnoDB Log File size is too small
    expr: mysql_global_variables_innodb_log_file_size < 16777216 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} InnoDB Log File size is too small"
      description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
  - alert: InnoDB Flush Log at Transaction Commit
    expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} InnoDB Flush Log at Transaction Commit"
      description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
  - alert: Table definition cache too small
    expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Table definition cache too small"
      description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
  - alert: Table open cache too small
    expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Table open cache too small"
      description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
  - alert: Thread stack size is possibly too small
    expr: mysql_global_variables_thread_stack < 262144
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Thread stack size is possibly too small"
      description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: InnoDB Buffer Pool Instances is too small
    expr: mysql_global_variables_innodb_buffer_pool_instances == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} InnoDB Buffer Pool Instances is too small"
      description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
  - alert: InnoDB Plugin is enabled
    expr: mysql_global_variables_ignore_builtin_innodb == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} InnoDB Plugin is enabled"
      description: "InnoDB Plugin is enabled"
  - alert: Binary Log is disabled
    expr: mysql_global_variables_log_bin != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Binary Log is disabled"
      description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
  - alert: Binlog Cache size too small
    expr: mysql_global_variables_binlog_cache_size < 1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Binlog Cache size too small"
      description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
  - alert: Binlog Statement Cache size too small
    expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Binlog Statement Cache size too small"
      description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Binlog Transaction Cache size too small
    expr: mysql_global_variables_binlog_cache_size  <1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Binlog Transaction Cache size too small"
      description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Sync Binlog is enabled
    expr: mysql_global_variables_sync_binlog == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Sync Binlog is enabled"
      description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
  - alert: IO thread stopped
    expr: mysql_slave_status_slave_io_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {
    
    { $labels.instance }} IO thread stopped"
      description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
  - alert: SQL thread stopped 
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {
    
    { $labels.instance }} SQL thread stopped"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Sync Binlog is enabled"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30 
    for: 1m
    labels:
      severity: warning 
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Slave lagging behind Master"
      description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
  - alert: Slave is NOT read only(Please ignore this warning indicator.)
    expr: mysql_global_variables_read_only != 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {
    
    { $labels.instance }} Slave is NOT read only"
      description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."

2. The alarm rule module of es

cd /home/monitor/prometheus/rules
cat es.yml
groups:
- name: elasticsearchStatsAlert
  rules:
  - alert: Elastic_Cluster_Health_RED
    expr: elasticsearch_cluster_health_status{
    
    color="red"}==1 
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {
    
    { $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
    
    { $labels.cluster }}"
      description: "Instance {
    
    { $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
    
    { $labels.cluster }}."
          
  - alert: Elastic_Cluster_Health_Yellow 
    expr: elasticsearch_cluster_health_status{
    
    color="yellow"}==1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " Instance {
    
    { $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
    
    { $labels.cluster }}" 
      description: "Instance {
    
    { $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
    
    { $labels.cluster }}."
          
  - alert: Elasticsearch_JVM_Heap_Too_High
    expr: elasticsearch_jvm_memory_used_bytes{
    
    area="heap"} / elasticsearch_jvm_memory_max_bytes{
    
    area="heap"} > 0.8
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "ElasticSearch node {
    
    { $labels.instance }} heap usage is high "
      description: "The heap in {
    
    { $labels.instance }} is over 80% for 15m."
          
  - alert: Elasticsearch_health_up
    expr: elasticsearch_cluster_health_up !=1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node: {
    
    { $labels.instance }} last scrape of the ElasticSearch cluster health failed"                               
      description: "ElasticSearch node: {
    
    { $labels.instance }} last scrape of the ElasticSearch cluster health failed"
          
  - alert: Elasticsearch_Too_Few_Nodes_Running
    expr: elasticsearch_cluster_health_number_of_nodes < 10
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "There are only {
    
    {
    
    $value}} < 10 ElasticSearch nodes running "                               
      description: "lasticSearch running on less than 10 nodes(total 10)"
          
  - alert: Elasticsearch_Count_of_JVM_GC_Runs
    expr: rate(elasticsearch_jvm_gc_collection_seconds_count{
    
    }[5m])>5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "ElasticSearch node {
    
    { $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {
    
    { $value }} "
      description: "ElasticSearch node {
    
    { $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {
    
    { $value }}"
          
  - alert: Elasticsearch_GC_Run_Time
    expr: rate(elasticsearch_jvm_gc_collection_seconds_sum[5m])>0.3
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node {
    
    { $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {
    
    { $value }}"
      description: "ElasticSearch node {
    
    { $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {
    
    { $value }}"
          
#  - alert: Elasticsearch_json_parse_failures
#    expr: elasticsearch_cluster_health_json_parse_failures>0
#    for: 1m
#    labels:
#      severity: critical
#    annotations:
#      summary: " ElasticSearch node {
    
    { $labels.instance }}: json parse failures > 0 and has a value of {
    
    { $value }}"
#      description: "ElasticSearch node {
    
    { $labels.instance }}: json parse failures > 0 and has a value of {
    
    { $value }}"
          
  - alert: Elasticsearch_breakers_tripped
    expr: rate(elasticsearch_breakers_tripped{
    
    }[5m])>0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node {
    
    { $labels.instance }}: breakers tripped > 0 and has a value of {
    
    { $value }}"
      description: "ElasticSearch node {
    
    { $labels.instance }}: breakers tripped > 0 and has a value of {
    
    { $value }}"
          
  - alert: Elasticsearch_health_timed_out
    expr: elasticsearch_cluster_health_timed_out>0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: " ElasticSearch node {
    
    { $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {
    
    { $value }}"
      description: "ElasticSearch node {
    
    { $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {
    
    { $value }}"

3. Node's alarm rule template (can also be merged)

1) cpu template

cat cpu.yml 
groups:
- name: cpu.rules
  rules:
  # Alert for any ×××tance that is unreachable for >5 minutes.
  - alert: NodeCpuUsage
    expr: 100-irate(node_cpu_seconds_total{
    
    job="node",mode="idle"}[5m])*100 > 1
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "{
    
    { $labels.instance }} cpu useage load too high"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been too hgih for more than 1 minutes."

2) File system template

cat file_sys.yml 
groups:
- name: file_sys.rules
  rules:
  - alert: NodeFilesystemUsage
    expr: (node_filesystem_size{
    
    device="rootfs"} - node_filesystem_free{
    
    device="rootfs"}) / node_filesystem_size{
    
    device="rootfs"} * 100 > 80
    for: 2m
    labels:
      severity: error
    annotations:
      summary: "{
    
    {
    
    $labels.instance}}: High Filesystem usage detected"
      description: "{
    
    {
    
    $labels.instance}}: Filesystem usage is above 80% (current value is: {
    
    { $value }}"
   

3) Memory template

cat memory.yml 
groups:
- name: mem.rules
  rules:
   - alert: NodeMemoryUsage
     expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 90
     for: 5m
     labels:
       severity: error
     annotations:
       summary: "Instance {
    
    { $labels.instance }} memory is too hight"
       description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 5 minutes."

4) Node survival template

cat node_up.yml 
groups:
- name: general.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Instance {
    
    { $labels.instance }} down"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 5 minutes."

5) After the merger

groups:
- name: Host_Monitor
  rules:
  # Alert for any ×××tance that is unreachable for >5 minutes.
  - alert: NodeCpuUsage
    expr: 100-irate(node_cpu_seconds_total{
    
    job="node",mode="idle"}[5m])*100 > 1
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "{
    
    { $labels.instance }} cpu useage load too high"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been too hgih for more than 1 minutes."

  - alert: NodeFilesystemUsage
    expr: (node_filesystem_size{
    
    device="rootfs"} - node_filesystem_free{
    
    device="rootfs"}) / node_filesystem_size{
    
    device="rootfs"} * 100 > 80
    for: 2m
    labels:
      severity: error
    annotations:
      summary: "{
    
    {
    
    $labels.instance}}: High Filesystem usage detected"
      description: "{
    
    {
    
    $labels.instance}}: Filesystem usage is above 80% (current value is: {
    
    { $value }}"
   
   - alert: NodeMemoryUsage
     expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 90
     for: 5m
     labels:
       severity: error
     annotations:
       summary: "Instance {
    
    { $labels.instance }} memory is too hight"
       description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 5 minutes."

  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: error
    annotations:
      summary: "Instance {
    
    { $labels.instance }} down"
      description: "{
    
    { $labels.instance }} of job {
    
    { $labels.job }} has been down for more than 5 minutes."

dashboard configuration

1. Download the grafana package and install

wget https://dl.grafana.com/oss/release/grafana-6.4.3-1.x86_64.rpm
yum localinstall -y grafana-6.4.3-1.x86_64.rpm
systemctl start grafana-server
systemctl enable grafana-server

2. Log in to grafana and connect to the database

Login to grafana

默认端口:3000
地址:http://ip:3000
初始管理员账号:admin,密码:admin

Connect to the database
Insert picture description here

3. Dashboard configuration

1)node dashboard

Connection address

https://grafana.com/grafana/dashboards?search=node%20export
https://grafana.com/grafana/dashboards/8919

Search the template of node
Insert picture description here
Copy the id of the template and
Insert picture description here
paste the id, import the template
Insert picture description here
Final display image
Insert picture description here

2)mysql dashboard

https://grafana.com/grafana/dashboards?search=Mysql%20over
https://grafana.com/dashboards/7362

Insert picture description here
Insert picture description here
Insert picture description here

3)es dashboard

https://grafana.com/grafana/dashboards/6483(推荐使用)

Insert picture description here
Insert picture description here

https://grafana.com/grafana/dashboards/2322(需要调整参数表达式)

Insert picture description here
Insert picture description here

label_values(elasticsearch_indices_docs{
    
    instance="$instance",cluster="$cluster", name!=""},name)

Insert picture description here

4)Nginx dashbaoard

https://grafana.com/grafana/dashboards/2949

Insert picture description here
Insert picture description here
Insert picture description here

https://grafana.com/grafana/dashboards/2984

Insert picture description here
Insert picture description here
Insert picture description here

5)tomcat dashboard

https://grafana.com/grafana/dashboards/8563(提供了jmx-exporter的配置)

When importing the template, you need to enter the name of the job. You can look up the job_name name in the tomcat json file (/home/monitor/prometheus/conf.d/tomcat_node.json) that can be viewed or configured on the web side of promethus.
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

grafana create user

1. Create users by sending emails

创建用户,提示未配置邮箱,直接忽略;通过邮箱的方式(这里只是用户名变成了邮箱名而已,实际上并不能将邮件发送到你的邮箱)
Insert picture description here
In the browser, press F 12to view invite的后端真实链接地址; I found that the link address accessed is localhost; obviously, there is no grafana service on my local computer, so I cannot open this address, and it is difficult to set a password
Insert picture description here

2. Create users by non-mail

Insert picture description here
Insert picture description here

3. Solve the problem that localhost cannot access the invitee address and build nginx reverse proxy

Through the above two methods, I will find that the final invite connection cannot be opened for access. The reason is that garafana is deployed on the server instead of my local computer. How to solve it? Reverse proxy, jump to the destination address I want to visit through nginx

1) Download nginx

http://nginx.org/download/nginx-1.16.1.zip

Insert picture description here

2) Decompression configuration

Insert picture description here

3) Start nginx

Double-click nginx.exe, and check the listening port
Insert picture description here

4) Login to access connection (paste the link address of invite)

Create an account by non-mail, and you can log in with
Insert picture description here
Insert picture description here
your user name. When you create an account by mail, you must use your mailbox as the user name to log in.
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_44729138/article/details/105764601