Prometheus Monitoring alert rules include Application Real-Time Monitoring Service
(ARMS) alert rules, Kubernetes alert rules, MongoDB alert rules, MySQL alert rules,
NGINX alert rules, and Redis alert rules.
ARMS alert rules
PodCpu75
100 * (sum(rate(container_cpu_usage_seconds_total[1m])) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_cpu_cores,
"pod_name", "$1", "pod", "(.*)")) by (pod_name))>75
The CPU utilization of a pod is greater than 75%.
PodMemory75
100 * (sum(container_memory_working_set_bytes) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_memory_bytes,
"pod_name", "$1", "pod", "(.*)")) by (pod_name))>75
The memory usage of a pod is greater than 75%.
pod_status_no_running
sum (kube_pod_status_phase{phase!="Running"}) by (pod,phase)
A pod is not running.
PodMem4GbRestart
(sum (container_memory_working_set_bytes{id!="/"})by (pod_name,container_name) /1024/1024/1024)>4
The memory of a pod is larger than 4 GB.
PodRestart
sum (increase (kube_pod_container_status_restarts_total{}[2m])) by (namespace,pod)
A pod is restarted.
KubeStateMetricsListErrors
(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m]))
/ sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01
An error occurs to a metric list.
KubeStateMetricsWatchErrors
(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m]))
/ sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01
An error occurs to Metric Watch.
NodeFilesystemAlmostOutOfSpace
( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""}
* 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
A node file system is running out of space.
NodeFilesystemSpaceFillingUp
( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""}
* 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h],
24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
A node file system is about to be fully occupied.
NodeFilesystemFilesFillingUp
( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""}
* 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h],
24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
Files in a node file system are about to be fully occupied.
NodeFilesystemAlmostOutOfFiles
( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""}
* 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 )
Almost no files exist in a node file system.
NodeNetworkReceiveErrs
increase(node_network_receive_errs_total[2m]) > 10
A network reception error occurs to a node.
NodeNetworkTransmitErrs
increase(node_network_transmit_errs_total[2m]) > 10
A network transmission error occurs to a node.
NodeHighNumberConntrackEntriesUsed
(node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
A large number of conntrack entries are used.
NodeClockSkewDetected
( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 )
or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <=
Time deviation occurs.
NodeClockNotSynchronising
min_over_time(node_timex_sync_status[5m]) == 0
Time inconsistency occurs.
KubePodCrashLooping
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60
* 5 > 0
A loop crash occurs.
KubePodNotReady
sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",
phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace,
pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0
A pod is not ready.
KubeDeploymentGenerationMismatch
kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"}
Deployment versions do not match.
KubeDeploymentReplicasMismatch
( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}
) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m])
== 0 )
Deployment replicas do not match.
KubeStatefulSetReplicasMismatch
( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"}
) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m])
== 0 )
State set replicas do not match.
KubeStatefulSetGenerationMismatch
kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"}
State set versions do not match.
KubeStatefulSetUpdateNotRolledOut
max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"}
unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"}
!= kube_statefulset_status_replicas_updated{job="kube-state-metrics"} )
A state set update is not rolled out.
KubeDaemonSetRolloutStuck
kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
A DaemonSet rollout is stuck.
KubeContainerWaiting
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"})
A container is waiting.
KubeDaemonSetNotScheduled
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"}
A DaemonSet is not scheduled.
KubeDaemonSetMisScheduled
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0
A DaemonSet is misscheduled.
KubeCronJobRunning
time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600
A cron job takes more than 1 hour to complete.
KubeJobCompletion
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"}
A job is complete.
KubeJobFailed
kube_job_failed{job="kube-state-metrics"} > 0
A job failed.
KubeHpaReplicasMismatch
(kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"})
and changes(kube_hpa_status_current_replicas[15m]) == 0
Host protected area (HPA) replicas do not match.
KubeHpaMaxedOut
kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"}
The maximum number of HPA replicas is reached.
KubeCPUOvercommit
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores)
> (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
The CPU is overcommitted.
KubeMemoryOvercommit
sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{}) / sum(kube_node_status_allocatable_memory_bytes)
> (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes)
The storage is overcommitted.
KubeCPUQuotaOvercommit
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores)
The CPU quota is overcommitted.
KubeMemoryQuotaOvercommit
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
/ sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5
The storage quota is overcommitted.
KubeQuotaExceeded
kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job,
type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90
The quota is exceeded.
CPUThrottlingHigh
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container,
pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container,
pod, namespace) > ( 25 / 100 )
The CPU is overheated.
KubePersistentVolumeFillingUp
kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet",
metrics_path="/metrics"} < 0.03
The volume capacity is insufficient.
KubePersistentVolumeErrors
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}
An error occurs to the volume capacity.
KubeVersionMismatch
count(count by (gitVersion) (label_replace(kubernetes_build_info{job! ~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[
0-9]*.[ 0-9]*). *"))) > 1
Versions do not match.
KubeClientErrors
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m]))
by (instance, job)) > 0.01
An error occurs to the client.
KubeAPIErrorBudgetBurn
sum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m)
> (14.40 * 0.01000)
Excessive API errors occur.
KubeAPILatencyHigh
( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left()
( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"}
>= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"}
The API latency is high.
KubeAPIErrorsHigh
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/ sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb)
Excessive API errors occur.
KubeClientCertificateExpiration
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job)
histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m])))
< 604800
The client certificate expires.
AggregatedAPIErrors
sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2
An error occurs to the aggregated API.
AggregatedAPIDown
sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0
The aggregated API is offline.
KubeAPIDown
absent(up{job="apiserver"} == 1)
An API operation is offline.
KubeNodeNotReady
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"}
A node is not ready.
KubeNodeUnreachable
kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"}
A node is unreachable.
KubeletTooManyPods
max(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance)
* on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node)
Excessive pods exist.
KubeNodeReadinessFlapping
sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by
(node) > 2
The readiness status changes frequently.
KubeletPlegDurationHigh
node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"}
The pod lifecycle event generator (PLEG) lasts for an extended period of time.
KubeletPodStartUpLatencyHigh
histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node)
kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60
The startup latency of a pod is high.
KubeletDown
absent(up{job="kubelet", metrics_path="/metrics"} == 1)
The kubelet is offline.
KubeSchedulerDown
absent(up{job="kube-scheduler"} == 1)
The Kubernetes scheduler is offline.
KubeControllerManagerDown
absent(up{job="kube-controller-manager"} == 1)
The controller manager is offline.
TargetDown
100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace,
service)) > 10
The target is offline.
NodeNetworkInterfaceFlapping
changes(node_network_up{job="node-exporter",device! ~"veth.+"}[2m]) > 2
The network interface status changes frequently.
MongodbReplicationLag
avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"})
The replication lantency is long.
MongodbReplicationHeadroom
(avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp)
- (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"})))
The replication margin is insufficient.
MongodbReplicationStatus3
mongodb_replset_member_state == 3
The replication status is 3.
MongodbReplicationStatus6
mongodb_replset_member_state == 6
The replication status is 6.
MongodbReplicationStatus8
mongodb_replset_member_state == 8
The replication status is 8.
MongodbReplicationStatus10
mongodb_replset_member_state == 10
The replication status is 10.
MongodbNumberCursorsOpen
mongodb_metrics_cursor_open{state="total_open"} > 10000
Excessive cursors exist.
MongodbCursorsTimeouts
sum (increase increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100
The cursor times out.
MongodbTooManyConnections
mongodb_connections{state="current"} > 500
Excessive connections exist.
MongodbVirtualMemoryUsage
(sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"})
BY (ip)) > 3
The virtual memory usage is high.
open files high
mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit)
* 0.75
Excessive files are opened.
Read buffer size is bigger than max. allowed packet size
mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet
The size of the read buffer exceeds the maximum allowed packet size.
Sort buffer possibly missconfigured
mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size
> 4*1024*1024
A configuration error may exist in the sort buffer.
Thread stack size is too small
mysql_global_variables_thread_stack <196608
The thread stack size is small.
Used more than 80% of max connections limited
mysql_global_status_max_used_connections > mysql_global_variables_max_connections
* 0.8
The maximum connection rate of 80% is reached.
InnoDB Force Recovery is enabled
mysql_global_variables_innodb_force_recovery != 0
Force recovery is enabled.
InnoDB Log File size is too small
mysql_global_variables_innodb_log_file_size < 16777216
The log file size is small.
InnoDB Flush Log at Transaction Commit
mysql_global_variables_innodb_flush_log_at_trx_commit != 1
Logs are refreshed when transactions are committed.
Table definition cache too small
mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
The number of cached table definitions is small.
Table open cache too small
mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
The number of cached open tables is small.
Thread stack size is possibly too small
mysql_global_variables_thread_stack < 262144
The thread stack size may be small.
InnoDB Buffer Pool Instances is too small
mysql_global_variables_innodb_buffer_pool_instances == 1
The number of instances in the buffer pool is small.
InnoDB Plugin is enabled
mysql_global_variables_ignore_builtin_innodb == 1
The plug-in is enabled.
Binary Log is disabled
mysql_global_variables_log_bin != 1
Binary logs are disabled.
Binlog Cache size too small
mysql_global_variables_binlog_cache_size < 1048576
The cache size is small.
Binlog Statement Cache size too small
mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size
The statement cache size is small.
Binlog Transaction Cache size too small
mysql_global_variables_binlog_cache_size <1048576
The transaction cache size is small.
Sync Binlog is enabled
mysql_global_variables_sync_binlog == 1
Binary logs are enabled.
IO thread stopped
mysql_slave_status_slave_io_running != 1
I/O threads are stopped.
SQL thread stopped
mysql_slave_status_slave_sql_running == 0
SQL threads are stopped.
Mysql_Too_Many_Connections
rate(mysql_global_status_threads_connected[5m])>200
Excessive connections exist.
Mysql_Too_Many_slow_queries
rate(mysql_global_status_slow_queries[5m])>3
Excessive slow queries exist.
Slave lagging behind Master
rate(mysql_slave_status_seconds_behind_master[1m]) >30
The primary node outperforms the secondary nodes.
Slave is NOT read only(Please ignore this warning indicator.)
mysql_global_variables_read_only != 0
Permissions on the secondary nodes are not read-only permissions.
NginxHighHttp4xxErrorRate
sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m]))
* 100 > 5
The rate of HTTP 4xx errors is high.
NginxHighHttp5xxErrorRate
sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m]))
* 100 > 5
The rate of HTTP 5xx errors is high.
NginxLatencyHigh
histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m]))
by (host, node)) > 10
The latency is high.
RedisDisconnectedSlaves
count without (instance, job) (redis_connected_slaves) - sum without (instance, job)
(redis_connected_slaves) - 1 > 1
Secondary nodes are disconnected.
RedisReplicationBroken
delta(redis_connected_slaves[1m]) < 0
The replication is interrupted.
RedisClusterFlapping
changes(redis_connected_slaves[5m]) > 2
Changes are detected in the connection to replica nodes.
RedisMissingBackup
time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
The backup is interrupted.
RedisOutOfMemory
redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
The memory is insufficient.
RedisTooManyConnections
redis_connected_clients > 100
Excessive connections exist.
RedisNotEnoughConnections
redis_connected_clients < 5
Connections are insufficient.
RedisRejectedConnections
increase(redis_rejected_connections_total[1m]) > 0
The connection is rejected.