Prometheusサービスのアラームルールと監視ツールのダッシュボード構成(4)
この記事では、以前の監視ツールであるPrometheusサービスの監視(3)に続き、主にPrometheus構成サービスのアラームルールとダッシュボード構成、およびgrafanaでのユーザーの作成について説明します。
サービスアラートルールテンプレート
1.MySQLアラームルールテンプレート
cd /home/monitor/prometheus && mkdir rules
cd rules
cat mysql_status.yml
groups:
- name: MySQL_Monitor
rules:
- alert: MySQL is down
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {
{ $labels.instance }} MySQL is down"
description: "MySQL database is down. This requires immediate action!"
- alert: open files high
expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} open files high"
description: "Open files is high. Please consider increasing open_files_limit."
- alert: Mysql_High_QP
expr: rate(mysql_global_status_questions[5m]) > 500
for: 2m
labels:
severity: warning
annotations:
summary: "{
{
$labels.instance}}: Mysql_High_QPS detected"
description: "{
{
$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {
{ $value }})"
- alert: Mysql_Too_Many_Connections
expr: rate(mysql_global_status_threads_connected[5m]) > 200
for: 2m
labels:
severity: warning
annotations:
summary: "{
{
$labels.instance}}: Mysql Too Many Connections detected"
description: "{
{
$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {
{ $value }})"
- alert: Mysql_Too_Many_slow_queries
expr: rate(mysql_global_status_slow_queries[5m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: "{
{
$labels.instance}}: Mysql_Too_Many_slow_queries detected"
description: "{
{
$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {
{ $value }})"
- alert: Read buffer size is bigger than max. allowed packet size
expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
- alert: Sort buffer possibly missconfigured
expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Sort buffer possibly missconfigured"
description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
- alert: Thread stack size is too small
expr: mysql_global_variables_thread_stack <196608
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Thread stack size is too small"
description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
- alert: Used more than 80% of max connections limited
expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Used more than 80% of max connections limited"
description: "Used more than 80% of max connections limited"
- alert: InnoDB Force Recovery is enabled
expr: mysql_global_variables_innodb_force_recovery != 0
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} InnoDB Force Recovery is enabled"
description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
- alert: InnoDB Log File size is too small
expr: mysql_global_variables_innodb_log_file_size < 16777216
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} InnoDB Log File size is too small"
description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
- alert: InnoDB Flush Log at Transaction Commit
expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
- alert: Table definition cache too small
expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Table definition cache too small"
description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
- alert: Table open cache too small
expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Table open cache too small"
description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
- alert: Thread stack size is possibly too small
expr: mysql_global_variables_thread_stack < 262144
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Thread stack size is possibly too small"
description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
- alert: InnoDB Buffer Pool Instances is too small
expr: mysql_global_variables_innodb_buffer_pool_instances == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
- alert: InnoDB Plugin is enabled
expr: mysql_global_variables_ignore_builtin_innodb == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} InnoDB Plugin is enabled"
description: "InnoDB Plugin is enabled"
- alert: Binary Log is disabled
expr: mysql_global_variables_log_bin != 1
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Binary Log is disabled"
description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
- alert: Binlog Cache size too small
expr: mysql_global_variables_binlog_cache_size < 1048576
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Binlog Cache size too small"
description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
- alert: Binlog Statement Cache size too small
expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Binlog Statement Cache size too small"
description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
- alert: Binlog Transaction Cache size too small
expr: mysql_global_variables_binlog_cache_size <1048576
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Binlog Transaction Cache size too small"
description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
- alert: Sync Binlog is enabled
expr: mysql_global_variables_sync_binlog == 1
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Sync Binlog is enabled"
description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
- alert: IO thread stopped
expr: mysql_slave_status_slave_io_running != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {
{ $labels.instance }} IO thread stopped"
description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
- alert: SQL thread stopped
expr: mysql_slave_status_slave_sql_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {
{ $labels.instance }} SQL thread stopped"
description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
- alert: SQL thread stopped
expr: mysql_slave_status_slave_sql_running != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {
{ $labels.instance }} Sync Binlog is enabled"
description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
- alert: Slave lagging behind Master
expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {
{ $labels.instance }} Slave lagging behind Master"
description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
- alert: Slave is NOT read only(Please ignore this warning indicator.)
expr: mysql_global_variables_read_only != 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {
{ $labels.instance }} Slave is NOT read only"
description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
2.esのアラームルールモジュール
cd /home/monitor/prometheus/rules
cat es.yml
groups:
- name: elasticsearchStatsAlert
rules:
- alert: Elastic_Cluster_Health_RED
expr: elasticsearch_cluster_health_status{
color="red"}==1
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {
{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
{ $labels.cluster }}"
description: "Instance {
{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
{ $labels.cluster }}."
- alert: Elastic_Cluster_Health_Yellow
expr: elasticsearch_cluster_health_status{
color="yellow"}==1
for: 1m
labels:
severity: critical
annotations:
summary: " Instance {
{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
{ $labels.cluster }}"
description: "Instance {
{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {
{ $labels.cluster }}."
- alert: Elasticsearch_JVM_Heap_Too_High
expr: elasticsearch_jvm_memory_used_bytes{
area="heap"} / elasticsearch_jvm_memory_max_bytes{
area="heap"} > 0.8
for: 1m
labels:
severity: critical
annotations:
summary: "ElasticSearch node {
{ $labels.instance }} heap usage is high "
description: "The heap in {
{ $labels.instance }} is over 80% for 15m."
- alert: Elasticsearch_health_up
expr: elasticsearch_cluster_health_up !=1
for: 1m
labels:
severity: critical
annotations:
summary: " ElasticSearch node: {
{ $labels.instance }} last scrape of the ElasticSearch cluster health failed"
description: "ElasticSearch node: {
{ $labels.instance }} last scrape of the ElasticSearch cluster health failed"
- alert: Elasticsearch_Too_Few_Nodes_Running
expr: elasticsearch_cluster_health_number_of_nodes < 10
for: 1m
labels:
severity: critical
annotations:
summary: "There are only {
{
$value}} < 10 ElasticSearch nodes running "
description: "lasticSearch running on less than 10 nodes(total 10)"
- alert: Elasticsearch_Count_of_JVM_GC_Runs
expr: rate(elasticsearch_jvm_gc_collection_seconds_count{
}[5m])>5
for: 1m
labels:
severity: critical
annotations:
summary: "ElasticSearch node {
{ $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {
{ $value }} "
description: "ElasticSearch node {
{ $labels.instance }}: Count of JVM GC runs > 5 per sec and has a value of {
{ $value }}"
- alert: Elasticsearch_GC_Run_Time
expr: rate(elasticsearch_jvm_gc_collection_seconds_sum[5m])>0.3
for: 1m
labels:
severity: critical
annotations:
summary: " ElasticSearch node {
{ $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {
{ $value }}"
description: "ElasticSearch node {
{ $labels.instance }}: GC run time in seconds > 0.3 sec and has a value of {
{ $value }}"
# - alert: Elasticsearch_json_parse_failures
# expr: elasticsearch_cluster_health_json_parse_failures>0
# for: 1m
# labels:
# severity: critical
# annotations:
# summary: " ElasticSearch node {
{ $labels.instance }}: json parse failures > 0 and has a value of {
{ $value }}"
# description: "ElasticSearch node {
{ $labels.instance }}: json parse failures > 0 and has a value of {
{ $value }}"
- alert: Elasticsearch_breakers_tripped
expr: rate(elasticsearch_breakers_tripped{
}[5m])>0
for: 1m
labels:
severity: critical
annotations:
summary: " ElasticSearch node {
{ $labels.instance }}: breakers tripped > 0 and has a value of {
{ $value }}"
description: "ElasticSearch node {
{ $labels.instance }}: breakers tripped > 0 and has a value of {
{ $value }}"
- alert: Elasticsearch_health_timed_out
expr: elasticsearch_cluster_health_timed_out>0
for: 1m
labels:
severity: critical
annotations:
summary: " ElasticSearch node {
{ $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {
{ $value }}"
description: "ElasticSearch node {
{ $labels.instance }}: Number of cluster health checks timed out > 0 and has a value of {
{ $value }}"
3.ノードのアラームルールテンプレート(マージすることもできます)
1)CPUテンプレート
cat cpu.yml
groups:
- name: cpu.rules
rules:
# Alert for any ×××tance that is unreachable for >5 minutes.
- alert: NodeCpuUsage
expr: 100-irate(node_cpu_seconds_total{
job="node",mode="idle"}[5m])*100 > 1
for: 1m
labels:
severity: error
annotations:
summary: "{
{ $labels.instance }} cpu useage load too high"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been too hgih for more than 1 minutes."
2)ファイルシステムテンプレート
cat file_sys.yml
groups:
- name: file_sys.rules
rules:
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{
device="rootfs"} - node_filesystem_free{
device="rootfs"}) / node_filesystem_size{
device="rootfs"} * 100 > 80
for: 2m
labels:
severity: error
annotations:
summary: "{
{
$labels.instance}}: High Filesystem usage detected"
description: "{
{
$labels.instance}}: Filesystem usage is above 80% (current value is: {
{ $value }}"
3)メモリテンプレート
cat memory.yml
groups:
- name: mem.rules
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: error
annotations:
summary: "Instance {
{ $labels.instance }} memory is too hight"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
4)ノードサバイバルテンプレート
cat node_up.yml
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {
{ $labels.instance }} down"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
5)合併後
groups:
- name: Host_Monitor
rules:
# Alert for any ×××tance that is unreachable for >5 minutes.
- alert: NodeCpuUsage
expr: 100-irate(node_cpu_seconds_total{
job="node",mode="idle"}[5m])*100 > 1
for: 1m
labels:
severity: error
annotations:
summary: "{
{ $labels.instance }} cpu useage load too high"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been too hgih for more than 1 minutes."
- alert: NodeFilesystemUsage
expr: (node_filesystem_size{
device="rootfs"} - node_filesystem_free{
device="rootfs"}) / node_filesystem_size{
device="rootfs"} * 100 > 80
for: 2m
labels:
severity: error
annotations:
summary: "{
{
$labels.instance}}: High Filesystem usage detected"
description: "{
{
$labels.instance}}: Filesystem usage is above 80% (current value is: {
{ $value }}"
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: error
annotations:
summary: "Instance {
{ $labels.instance }} memory is too hight"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: error
annotations:
summary: "Instance {
{ $labels.instance }} down"
description: "{
{ $labels.instance }} of job {
{ $labels.job }} has been down for more than 5 minutes."
ダッシュボード構成
1. grafanaパッケージをダウンロードして、インストールします
wget https://dl.grafana.com/oss/release/grafana-6.4.3-1.x86_64.rpm
yum localinstall -y grafana-6.4.3-1.x86_64.rpm
systemctl start grafana-server
systemctl enable grafana-server
2. grafanaにログインし、データベースに接続します
grafanaにログイン
默认端口:3000
地址:http://ip:3000
初始管理员账号:admin,密码:admin
データベースに接続します
3.ダッシュボードの構成
1)ノードダッシュボード
接続アドレス
https://grafana.com/grafana/dashboards?search=node%20export
https://grafana.com/grafana/dashboards/8919
ノードのテンプレートを検索するテンプレートの
IDをコピーして
貼り付け、テンプレートをインポートし
ます最終表示画像
2)mysqlダッシュボード
https://grafana.com/grafana/dashboards?search=Mysql%20over
https://grafana.com/dashboards/7362
3)esダッシュボード
https://grafana.com/grafana/dashboards/6483(推荐使用)
https://grafana.com/grafana/dashboards/2322(需要调整参数表达式)
label_values(elasticsearch_indices_docs{
instance="$instance",cluster="$cluster", name!=""},name)
4)Nginx dashbaoard
https://grafana.com/grafana/dashboards/2949
https://grafana.com/grafana/dashboards/2984
5)Tomcatダッシュボード
https://grafana.com/grafana/dashboards/8563(提供了jmx-exporter的配置)
テンプレートをインポートするときは、ジョブの名前を入力する必要があります。ジョブの名前は、Tomcat jsonファイル(/home/monitor/prometheus/conf.d/tomcat_node.json)で検索できます。このファイルは、で表示または構成できます。 PromethusのWeb側。
grafanaはユーザーを作成します
1.メールを送信してユーザーを作成します
创建用户,提示未配置邮箱,直接忽略;通过邮箱的方式(这里只是用户名变成了邮箱名而已,实际上并不能将邮件发送到你的邮箱)
ブラウザでFキーを押し12
て表示しinvite的后端真实链接地址
ます。アクセスしたリンクアドレスがローカルホストであることがわかりました。明らかに、ローカルコンピュータにグラファナサービスがないため、このアドレスを開くことができず、パスワードを設定するのが困難です。
2.メール以外でユーザーを作成する
3.ローカルホストが招待者のアドレスにアクセスできず、nginxリバースプロキシを構築できないという問題を解決します
上記の2つの方法では、最終的な招待接続を開いてアクセスできないことがわかります。その理由は、garafanaがローカルコンピューターではなくサーバーに展開されているためです。解決方法は?リバースプロキシ、nginxを介してアクセスしたい宛先アドレスにジャンプします
1)nginxをダウンロードする
http://nginx.org/download/nginx-1.16.1.zip
2)減圧構成
3)nginxを起動します
nginx.exeをダブルクリックし、リスニングポートを確認します
4)ログインして接続にアクセスします(招待のリンクアドレスを貼り付けます)
メール以外でアカウントを作成する場合は、ユーザー名でログインできます。
メールでアカウントを作成する場合は、メールボックスをユーザー名として使用してログインする必要があります。