Prometheus alarm rule description

Prometheus monitoring alarm rules include ARMS alarm rules, K8s alarm rules, MongoDB alarm rules, MySQL alarm rules, Nginx alarm rules, and Redis alarm rules.

ARMS alarm rules

 
Alarm name expression Data collection time (minutes) Alarm trigger condition
PodCpu75 100 * (sum(rate(container_cpu_usage_seconds_total[1m])) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_cpu_cores, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 7 The CPU usage of the Pod is greater than 75%.
PodMemory75 100 * (sum(container_memory_working_set_bytes) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_memory_bytes, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 5 Pod memory usage is greater than 75%.
pod_status_no_running sum (kube_pod_status_phase{phase!="Running"}) by (pod,phase) 5 The status of the Pod is not running.
PodMem4GbRestart (sum (container_memory_working_set_bytes{id!="/"})by (pod_name,container_name) /1024/1024/1024)>4 5 The memory of the Pod is greater than 4GB.
PodRestart sum (increase (kube_pod_container_status_restarts_total{}[2m])) by (namespace,pod) >0 5 The Pod restarts.

K8s alarm rules

 
Alarm name expression Data collection time (minutes) Alarm trigger condition
KubeStateMetricsListErrors (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 15 Metric List error.
KubeStateMetricsWatchErrors (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 15 Metric Watch error.
NodeFilesystemAlmostOutOfSpace ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 The Node file system is about to run out of space.
NodeFilesystemSpaceFillingUp ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 Node file system space is about to be full.
NodeFilesystemFilesFillingUp ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 Node文件系统文件即将占满。
NodeFilesystemAlmostOutOfFiles ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) 60 Node文件系统几乎无文件。
NodeNetworkReceiveErrs increase(node_network_receive_errs_total[2m]) > 10 60 Node网络接收错误。
NodeNetworkTransmitErrs increase(node_network_transmit_errs_total[2m]) > 10 60 Node网络传输错误。
NodeHighNumberConntrackEntriesUsed (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 使用大量Conntrack条目。
NodeClockSkewDetected ( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 ) or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0 ) 10 出现时间偏差。
NodeClockNotSynchronising min_over_time(node_timex_sync_status[5m]) == 0 10 出现时间不同步。
KubePodCrashLooping rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 15 出现循环崩溃。
KubePodNotReady sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0 15 Pod未准备好。
KubeDeploymentGenerationMismatch kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} 15 出现部署版本不匹配。
KubeDeploymentReplicasMismatch ( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} ) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) 15 出现部署副本不匹配。
KubeStatefulSetReplicasMismatch ( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} ) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) 15 状态集副本不匹配。
KubeStatefulSetGenerationMismatch kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} 15 状态集版本不匹配。
KubeStatefulSetUpdateNotRolledOut max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) 15 状态集更新未推出。
KubeDaemonSetRolloutStuck kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00 15 DaemonSet推出回退。
KubeContainerWaiting sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 60 容器等待。
KubeDaemonSetNotScheduled kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 10 DaemonSet无计划。
KubeDaemonSetMisScheduled kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 15 Daemon缺失计划。
KubeCronJobRunning time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 60 若Cron任务完成时间大于1小。
KubeJobCompletion kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 60 任务完成。
KubeJobFailed kube_job_failed{job="kube-state-metrics"} > 0 15 任务失败。
KubeHpaReplicasMismatch (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 15 HPA副本不匹配。
KubeHpaMaxedOut kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} 15 HPA副本超过最大值。
KubeCPUOvercommit sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) 5 CPU过载。
KubeMemoryOvercommit sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{}) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes) 5 存储过载。
KubeCPUQuotaOvercommit sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 5 CPU额度过载。
KubeMemoryQuotaOvercommit sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 5 存储额度过载。
KubeQuotaExceeded kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90 15 若配额超过限制。
CPUThrottlingHigh sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 ) 15 CPU过热。
KubePersistentVolumeFillingUp kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"} < 0.03 1 存储卷容量即将不足。
KubePersistentVolumeErrors kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 5 存储卷容量出错。
KubeVersionMismatch count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1 15 版本不匹配。
KubeClientErrors (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.01 15 客户端出错。
KubeAPIErrorBudgetBurn sum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m) > (14.40 * 0.01000) 2 API错误过多。
KubeAPILatencyHigh ( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left() ( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"} > 1 5 API延迟过高。
KubeAPIErrorsHigh sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05 10 API错误过多。
KubeClientCertificateExpiration apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 客户端认证过期。
AggregatedAPIErrors sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2 聚合API出错。
AggregatedAPIDown sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0 5 聚合API下线。
KubeAPIDown absent(up{job="apiserver"} == 1) 15 API下线。
KubeNodeNotReady kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 15 Node未准备好。
KubeNodeUnreachable kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 2 Node无法获取。
KubeletTooManyPods max(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node) > 0.95 15 Pod过多。
KubeNodeReadinessFlapping sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2 15 准备状态变更次数过多。
KubeletPlegDurationHigh node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 5 PLEG持续时间过长。
KubeletPodStartUpLatencyHigh histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60 15 Pod启动延迟过高。
KubeletDown absent(up{job="kubelet", metrics_path="/metrics"} == 1) 15 Kubelet下线。
KubeSchedulerDown absent(up{job="kube-scheduler"} == 1) 15 Kubelet日程下线。
KubeControllerManagerDown absent(up{job="kube-controller-manager"} == 1) 15 Controller Manager下线。
TargetDown 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10 10 目标下线。
NodeNetworkInterfaceFlapping changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2 2 网络接口状态变更过频繁。

MongoDB报警规则

 
报警名称 表达式 采集数据时间(分钟) 报警触发条件
MongodbReplicationLag avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 5 复制延迟过长。
MongodbReplicationHeadroom (avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 0 5 复制余量不足。
MongodbReplicationStatus3 mongodb_replset_member_state == 3 5 复制状态为3。
MongodbReplicationStatus6 mongodb_replset_member_state == 6 5 复制状态为6。
MongodbReplicationStatus8 mongodb_replset_member_state == 8 5 复制状态为8。
MongodbReplicationStatus10 mongodb_replset_member_state == 10 5 复制状态为10。
MongodbNumberCursorsOpen mongodb_metrics_cursor_open{state="total_open"} > 10000 5 打开数字光标数量过多。
MongodbCursorsTimeouts sum (increase increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100 5 若光标超。
MongodbTooManyConnections mongodb_connections{state="current"} > 500 5 连接过多。
MongodbVirtualMemoryUsage (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3 5 虚拟内存使用率过高。

MySQL报警规则

 
报警名称 表达式 采集数据时间(分钟) 报警触发条件
MySQL is down mysql_up == 0 1 MySQL下线。
open files high mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 1 打开文件数量偏高。
Read buffer size is bigger than max. allowed packet size mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 1 读取缓存区超过数据包最大限制。
Sort buffer possibly missconfigured mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 1 排序缓冲区可能存在配置错误。
Thread stack size is too small mysql_global_variables_thread_stack <196608 1 线程堆栈太小。
Used more than 80% of max connections limited mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 1 使用超过80%连接限制。
InnoDB Force Recovery is enabled mysql_global_variables_innodb_force_recovery != 0 1 启用强制恢复。
InnoDB Log File size is too small mysql_global_variables_innodb_log_file_size < 16777216 1 日志文件过小。
InnoDB Flush Log at Transaction Commit mysql_global_variables_innodb_flush_log_at_trx_commit != 1 1 在事务提交时刷新日志。
Table definition cache too small mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache 1 表定义缓存过小。
Table open cache too small mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 1 表打开缓存过小。
Thread stack size is possibly too small mysql_global_variables_thread_stack < 262144 1 线程堆栈可能过小。
InnoDB Buffer Pool Instances is too small mysql_global_variables_innodb_buffer_pool_instances == 1 1 缓冲池实例过小。
InnoDB Plugin is enabled mysql_global_variables_ignore_builtin_innodb == 1 1 插件启用。
Binary Log is disabled mysql_global_variables_log_bin != 1 1 二进制日志禁用。
Binlog Cache size too small mysql_global_variables_binlog_cache_size < 1048576 1 缓存过小。
Binlog Statement Cache size too small mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0 1 声明缓存过小。
Binlog Transaction Cache size too small mysql_global_variables_binlog_cache_size <1048576 1 交易缓存过小。
Sync Binlog is enabled mysql_global_variables_sync_binlog == 1 1 二进制日志启用。
IO thread stopped mysql_slave_status_slave_io_running != 1 1 IO线程停止。
SQL thread stopped mysql_slave_status_slave_sql_running == 0 1 SQL线程停止。
Mysql_Too_Many_Connections rate(mysql_global_status_threads_connected[5m])>200 5 连接过多。
Mysql_Too_Many_slow_queries rate(mysql_global_status_slow_queries[5m])>3 5 慢查询过多。
Slave lagging behind Master rate(mysql_slave_status_seconds_behind_master[1m]) >30 1 从机表现落后于主机。
Slave is NOT read only(Please ignore this warning indicator.) mysql_global_variables_read_only != 0 1 从机权限不是只读。

Nginx报警规则

 
报警名称 表达式 采集数据时间(分钟) 报警触发条件
NginxHighHttp4xxErrorRate sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 5 HTTP 4xx错误率过高。
NginxHighHttp5xxErrorRate sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 5 HTTP 5xx错误率过高。
NginxLatencyHigh histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10 5 延迟过高。

Redis报警规则

 
报警名称 表达式 采集数据时间(分钟) 报警触发条件
RedisDown redis_up == 0 5 Redis下线。
RedisMissingMaster count(redis_instance_info{role="master"}) == 0 5 Master缺失。
RedisTooManyMasters count(redis_instance_info{role="master"}) > 1 5 Master过多。
RedisDisconnectedSlaves count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1 5 Slave连接断开。
RedisReplicationBroken delta(redis_connected_slaves[1m]) < 0 5 复制中断。
RedisClusterFlapping changes(redis_connected_slaves[5m]) > 2 5 副本连接识别变更。
RedisMissingBackup time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 5 备份中断。
RedisOutOfMemory redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 5 内存不足。
RedisTooManyConnections redis_connected_clients > 100 5 连接过多。
RedisNotEnoughConnections redis_connected_clients < 5 5 连接不足。
RedisRejectedConnections increase(redis_rejected_connections_total[1m]) > 0 5 连接被拒绝。

Guess you like

Origin blog.csdn.net/youligg/article/details/109256044
Recommended