添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Alert thresholds depend on nature of applications. Some queries in this page may have arbitrary tolerance threshold. Building an efficient and battle-tested monitoring platform takes time. 😉 severity : warning annotations : summary : Prometheus job missing (instance {{ $labels.instance }}) description : " A Prometheus job has disappeared \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Prometheus target missing (instance {{ $labels.instance }}) description : " A Prometheus target has disappeared. An exporter might be crashed. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Prometheus all targets missing (instance {{ $labels.instance }}) description : " A Prometheus job does not have living target anymore. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: PrometheusTargetMissingWithWarmupTime
    expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
      description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusConfigurationReloadFailure
    expr: prometheus_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping. [copy]
  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusAlertmanagerJobMissing
    expr: absent(up{job="alertmanager"})
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
      description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusAlertmanagerConfigurationReloadFailure
    expr: alertmanager_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusAlertmanagerConfigNotSynced
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager. [copy] severity : critical annotations : summary : Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }}) description : " Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. [copy]
  - alert: PrometheusRuleEvaluationFailures
    expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query. [copy]
  - alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Prometheus target empty (instance {{ $labels.instance }}) description : " Prometheus has no target in service discovery \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned. [copy]
  - alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  -




    
 alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PrometheusTimeseriesCardinality
    expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})
      description: "




    
The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostOutOfMemory
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostMemoryUnderMemoryPressure
    expr: (rate(node_vmstat_pgmajfault[1m]) > 1000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Node memory is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }}) [copy]
  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  - alert: HostMemoryIsUnderutilized
    expr: (100 - (avg_over_time(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 1w
    labels:
      severity: info
    annotations:
      summary: Host Memory is underutilized (instance {{ $labels.instance }})
      description: "Node memory is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualNetworkThroughputIn
    expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualNetworkThroughputOut
    expr: (sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualDiskReadRate
    expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualDiskWriteRate
    expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostOutOfDiskSpace
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}




    
\n  LABELS = {{ $labels }}"
Filesystem is predicted to run out of space within the next 24 hours at current write rate [copy]
  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostDiskWillFillIn24Hours
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostOutOfInodes
    expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Host filesystem device error (instance {{ $labels.instance }}) description : " {{ $labels.instance }}: Device error with the {{ $labels.mountpoint }} filesystem \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Filesystem is predicted to run out of inodes within the next 24 hours at current write rate [copy]
  - alert: HostInodesWillFillIn24Hours
    expr: (node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!="msdosfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!="msdosfs"} == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualDiskReadLatency
    expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualDiskWriteLatency
    expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostHighCpuLoad
    expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  - alert: HostCpuIsUnderutilized
    expr: (100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 1w
    labels:
      severity: info
    annotations:
      summary: Host CPU is underutilized (instance {{ $labels.instance }})
      description: "CPU load is < 20% for 1 week. Consider reducing the number of CPUs.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"




    
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy]
  - alert: HostCpuStealNoisyNeighbor
    expr: (avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostCpuHighIowait
    expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU high iowait (instance {{ $labels.instance }})
      description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostUnusualDiskIo
    expr: (rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk IO (instance {{ $labels.instance }})
      description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # 10000 context switches is an arbitrary number.
  # The alert threshold depends on the nature of the application.
  # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
  - alert: HostContextSwitching
    expr: ((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host context switching (instance {{ $labels.instance }})
      description: "Context switching is growing on the node (> 10000 / CPU / s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostSwapIsFillingUp
    expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostSystemdServiceCrashed
    expr: (node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed (instance {{ $labels.instance }})
      description: "systemd service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostPhysicalComponentTooHot
    expr: ((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node, sensor) node_hwmon_sensor_label{label!="tctl"} > 75)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostNodeOvertemperatureAlarm
    expr: (node_hwmon_temp_crit_alarm_celsius == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically. [copy]
  - alert: HostRaidArrayGotInactive
    expr: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host RAID array got inactive (instance {{ $labels.instance }})
      description: "RAID array {{ $labels.device }} is 




    
in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap [copy]
  - alert: HostRaidDiskFailure
    expr: (node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host RAID disk failure (instance {{ $labels.instance }})
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostKernelVersionDeviations
    expr: (count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: "Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostOomKillDetected
    expr: (increase(node_vmstat_oom_kill[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes. [copy]
  - alert: HostEdacCorrectableErrorsDetected
    expr: (increase(node_edac_correctable_errors_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Host {{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes. [copy]
  - alert: HostEdacUncorrectableErrorsDetected
    expr: (node_edac_uncorrectable_errors_total > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes. [copy]
  - alert: HostNetworkReceiveErrors
    expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes. [copy]
  - alert: HostNetworkTransmitErrors
    expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Transmit Errors (instance {{ $labels.instance }})
      description: "Host 




    
{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The network interface "{{ $labels.device }}" on "{{ $labels.instance }}" is getting overloaded. [copy]
  - alert: HostNetworkInterfaceSaturated
    expr: ((rate(node_network_receive_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"} > 0.8 < 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Host Network Interface Saturated (instance {{ $labels.instance }})
      description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostNetworkBondDegraded
    expr: ((node_bonding_active - node_bonding_slaves) != 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Bond Degraded (instance {{ $labels.instance }})
      description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostConntrackLimit
    expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit (instance {{ $labels.instance }})
      description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host. [copy]
  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew (instance {{ $labels.instance }})
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostClockNotSynchronising
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising (instance {{ $labels.instance }})
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HostRequiresReboot
    expr: (node_reboot_required > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 4h
    labels:
      severity: info
    annotations:
      summary: Host requires reboot (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: SmartDeviceTemperatureWarning
    expr: smartctl_device_temperature > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Smart device temperature warning (instance {{ $labels.instance }})
      description: "Device temperature  warning (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: SmartDeviceTemperatureCritical
    expr: smartctl_device_temperature > 80
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Smart device temperature critical (instance {{ $labels.instance }})
      description: "Device temperature critical  (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels




    
 }}"
severity : critical annotations : summary : Smart critical warning (instance {{ $labels.instance }}) description : " device has critical warning (instance {{ $labels.instance }}) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Smart media errors (instance {{ $labels.instance }}) description : " device has media errors (instance {{ $labels.instance }}) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: SmartNvmeWearoutIndicator
    expr: smartctl_device_available_spare{device=~"nvme.*"} < smartctl_device_available_spare_threshold{device=~"nvme.*"}
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Smart NVME Wearout Indicator (instance {{ $labels.instance }})
      description: "NVMe device is wearing out (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerKilled
    expr: time() - container_last_seen > 60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container absent (instance {{ $labels.instance }})
      description: "A container is absent for 5 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerHighCpuUtilization
    expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High CPU utilization (instance {{ $labels.instance }})
      description: "Container CPU utilization is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  - alert: ContainerHighMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerVolumeUsage
    expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container Volume usage (instance {{ $labels.instance }})
      description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerHighThrottleRate
    expr: sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 )
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerLowCpuUtilization
    expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20
    for: 7d
    labels:
      severity: info
    annotations:
      summary: Container Low CPU utilization (instance {{ $labels.instance }})
      description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n  VALUE = 




    
{{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ContainerLowMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20
    for: 7d
    labels:
      severity: info
    annotations:
      summary: Container Low Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Blackbox probe failed (instance {{ $labels.instance }}) description : " Probe failed \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: BlackboxConfigurationReloadFailure
    expr: blackbox_exporter_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox configuration reload failure (instance {{ $labels.instance }})
      description: "Blackbox configuration reload failure\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxSlowProbe
    expr: avg_over_time(probe_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox slow probe (instance {{ $labels.instance }})
      description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxSslCertificateWillExpireSoon
    expr: 3 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 20
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in less than 20 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxSslCertificateWillExpireSoon
    expr: 0 <= round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
      description: "SSL certificate expires in less than 3 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # For probe_ssl_earliest_cert_expiry to be exposed after expiration, you
  # need to enable insecure_skip_verify. Note that this will disable
  # certificate validation.
  # See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config
  - alert: BlackboxSslCertificateExpired
    expr: round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Blackbox SSL certificate expired (instance {{ $labels.instance }})
      description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxProbeSlowHttp
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow HTTP (instance {{ $labels.instance }})
      description: "HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: BlackboxProbeSlowPing
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Blackbox probe slow ping (instance {{ $labels.instance }})
      description: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}




    
\n  LABELS = {{ $labels }}"
  - alert: WindowsServerCollectorError
    expr: windows_exporter_collector_success == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Windows Server collector Error (instance {{ $labels.instance }})
      description: "Collector {{ $labels.collector }} was not successful\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: WindowsServerServiceStatus
    expr: windows_service_status{status="ok"} != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Windows Server service Status (instance {{ $labels.instance }})
      description: "Windows Service state is not OK\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: WindowsServerCpuUsage
    expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Windows Server CPU Usage (instance {{ $labels.instance }})
      description: "CPU Usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: WindowsServerMemoryUsage
    expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100) > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Windows Server memory Usage (instance {{ $labels.instance }})
      description: "Memory usage is more than 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: WindowsServerDiskSpaceUsage
    expr: 100.0 - 100 * ((windows_logical_disk_free_bytes / 1024 / 1024 ) / (windows_logical_disk_size_bytes / 1024 / 1024)) > 80
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Windows Server disk Space Usage (instance {{ $labels.instance }})
      description: "Disk usage is more than 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: VirtualMachineMemoryWarning
    expr: vmware_vm_mem_usage_average / 100 >= 80 and vmware_vm_mem_usage_average / 100 < 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Virtual Machine Memory Warning (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: VirtualMachineMemoryCritical
    expr: vmware_vm_mem_usage_average / 100 >= 90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Virtual Machine Memory Critical (instance {{ $labels.instance }})
      description: "High memory usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : High Number of Snapshots (instance {{ $labels.instance }}) description : " High snapshots number on {{ $labels.instance }}: {{ $value }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: OutdatedSnapshots
    expr: (time() - vmware_vm_snapshot_timestamp_seconds) / (60 * 60 * 24) >= 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Outdated Snapshots (instance {{ $labels.instance }})
      description: "Outdated snapshots on {{ $labels.instance }}: {{ $value | printf \"%.0f\"}} days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataHighCpuUsage
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="idle"}[1m]) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high cpu usage (instance {{ $labels.instance }})
      description:




    
 "Netdata high CPU usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit. [copy]
  - alert: HostCpuStealNoisyNeighbor
    expr: rate(netdata_cpu_cpu_percentage_average{dimension="steal"}[1m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataHighMemoryUsage
    expr: 100 / netdata_system_ram_MB_average * netdata_system_ram_MB_average{dimension=~"free|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata high memory usage (instance {{ $labels.instance }})
      description: "Netdata high memory usage (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataLowDiskSpace
    expr: 100 / netdata_disk_space_GB_average * netdata_disk_space_GB_average{dimension=~"avail|cached"} < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Netdata low disk space (instance {{ $labels.instance }})
      description: "Netdata low disk space (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataPredictedDiskFull
    expr: predict_linear(netdata_disk_space_GB_average{dimension=~"avail|cached"}[3h], 24 * 3600) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata predicted disk full (instance {{ $labels.instance }})
      description: "Netdata predicted disk full in 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataMdMismatchCntUnsynchronizedBlocks
    expr: netdata_md_mismatch_cnt_unsynchronized_blocks_average > 1024
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Netdata MD mismatch cnt unsynchronized blocks (instance {{ $labels.instance }})
      description: "RAID Array have unsynchronized blocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataDiskReallocatedSectors
    expr: increase(netdata_smartd_log_reallocated_sectors_count_sectors_average[1m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Netdata disk reallocated sectors (instance {{ $labels.instance }})
      description: "Reallocated sectors on disk\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataDiskCurrentPendingSector
    expr: netdata_smartd_log_current_pending_sector_count_sectors_average > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata disk current pending sector (instance {{ $labels.instance }})
      description: "Disk current pending sector\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NetdataReportedUncorrectableDiskSectors
    expr: increase(netdata_smartd_log_offline_uncorrectable_sector_count_sectors_average[2m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Netdata reported uncorrectable disk sectors (instance {{ $labels.instance }})
      description: "Reported uncorrectable disk sectors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlTooManyConnections(>80%)
    expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})
      description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlHighPreparedStatementsUtilization(>80%)




    

    expr: max_over_time(mysql_global_status_prepared_stmt_count[1m]) / mysql_global_variables_max_prepared_stmt_count * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high prepared statements utilization (> 80%) (instance {{ $labels.instance }})
      description: "High utilization of prepared statements (>80%) on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlHighThreadsRunning
    expr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL high threads running (instance {{ $labels.instance }})
      description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlSlaveIoThreadNotRunning
    expr: ( mysql_slave_status_slave_io_running and ON (instance) mysql_slave_status_master_server_id > 0 ) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave IO thread not running (instance {{ $labels.instance }})
      description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlSlaveSqlThreadNotRunning
    expr: ( mysql_slave_status_slave_sql_running and ON (instance) mysql_slave_status_master_server_id > 0) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave SQL thread not running (instance {{ $labels.instance }})
      description: "MySQL Slave SQL thread not running on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlSlaveReplicationLag
    expr: ( (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) and ON (instance) mysql_slave_status_master_server_id > 0 ) > 30
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: MySQL Slave replication lag (instance {{ $labels.instance }})
      description: "MySQL replication lag on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlSlowQueries
    expr: increase(mysql_global_status_slow_queries[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MySQL slow queries (instance {{ $labels.instance }})
      description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MysqlInnodbLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: MySQL InnoDB log waits (instance {{ $labels.instance }})
      description: "MySQL innodb log writes stalling\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : info annotations : summary : MySQL restarted (instance {{ $labels.instance }}) description : " MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Postgresql down (instance {{ $labels.instance }}) description : " Postgresql instance is down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: PostgresqlRestarted
    expr: time() - pg_postmaster_start_time_seconds < 60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql restarted (instance {{ $labels.instance }})
      description: "Postgresql restarted\n  VALUE = {{ $value }}\n  LABELS 




    
= {{ $labels }}"
severity : critical annotations : summary : Postgresql exporter error (instance {{ $labels.instance }}) description : " Postgresql exporter is showing errors. A query may be buggy in query.yaml \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: PostgresqlTableNotAutoVacuumed
    expr: (pg_stat_user_tables_last_autovacuum > 0) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not auto vacuumed (instance {{ $labels.instance }})
      description: "Table {{ $labels.relname }} has not been auto vacuumed for 10 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlTableNotAutoAnalyzed
    expr: (pg_stat_user_tables_last_autoanalyze > 0) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql table not auto analyzed (instance {{ $labels.instance }})
      description: "Table {{ $labels.relname }} has not been auto analyzed for 10 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlTooManyConnections
    expr: sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many connections (instance {{ $labels.instance }})
      description: "PostgreSQL instance has too many connections (> 80%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlNotEnoughConnections
    expr: sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql not enough connections (instance {{ $labels.instance }})
      description: "PostgreSQL instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlDeadLocks
    expr: increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql dead locks (instance {{ $labels.instance }})
      description: "PostgreSQL has dead-locks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlHighRollbackRate
    expr: sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m])))) > 0.02
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Postgresql high rollback rate (instance {{ $labels.instance }})
      description: "Ratio of transactions being aborted compared to committed is > 2 %\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlCommitRateLow
    expr: rate(pg_stat_database_xact_commit[1m]) < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Postgresql commit rate low (instance {{ $labels.instance }})
      description: "Postgresql seems to be processing very few transactions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Postgresql low XID consumption (instance {{ $labels.instance }}) description : " Postgresql seems to be consuming transaction IDs very slowly \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: PostgresqlHighRateStatementTimeout
    expr: rate(postgresql_errors_total{type="statement_timeout"}[1m]) > 3
    for: 0m
    labels:
      severity:




    
 critical
    annotations:
      summary: Postgresql high rate statement timeout (instance {{ $labels.instance }})
      description: "Postgres transactions showing high rate of statement timeouts\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlHighRateDeadlock
    expr: increase(postgresql_errors_total{type="deadlock_detected"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql high rate deadlock (instance {{ $labels.instance }})
      description: "Postgres detected deadlocks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlUnusedReplicationSlot
    expr: pg_replication_slots_active == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Postgresql unused replication slot (instance {{ $labels.instance }})
      description: "Unused Replication Slots\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlTooManyDeadTuples
    expr: ((pg_stat_user_tables_n_dead_tup > 10000) / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup)) >= 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Postgresql too many dead tuples (instance {{ $labels.instance }})
      description: "PostgreSQL dead tuples is too large\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PostgresqlConfigurationChanged
    expr: {__name__=~"pg_settings_.*"} != ON(__name__) {__name__=~"pg_settings_([^t]|t[^r]|tr[^a]|tra[^n]|tran[^s]|trans[^a]|transa[^c]|transac[^t]|transact[^i]|transacti[^o]|transactio[^n]|transaction[^_]|transaction_[^r]|transaction_r[^e]|transaction_re[^a]|transaction_rea[^d]|transaction_read[^_]|transaction_read_[^o]|transaction_read_o[^n]|transaction_read_on[^l]|transaction_read_onl[^y]).*"} OFFSET 5m
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Postgresql configuration changed (instance {{ $labels.instance }})
      description: "Postgres Database configuration change has occurred\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`. [copy]
  - alert: PostgresqlSslCompressionActive
    expr: sum(pg_stat_ssl_compression) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Postgresql SSL compression active (instance {{ $labels.instance }})
      description: "Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction. [copy]
  - alert: PostgresqlTooManyLocksAcquired
    expr: ((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Postgresql too many locks acquired (instance {{ $labels.instance }})
      description: "Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};` [copy]
  # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
  - alert: PostgresqlBloatIndexHigh(>80%)
    expr: pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})
      description: "The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};` [copy]
  # See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737
  - alert: PostgresqlBloatTableHigh(>80%)
    expr: pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})




    

      description: "The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};` [copy]
  - alert: PostgresqlInvalidIndex
    expr: pg_genaral_index_info_pg_relation_size{indexrelname=~".*ccnew.*"}
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Postgresql invalid index (instance {{ $labels.instance }})
      description: "The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : SQL Server down (instance {{ $labels.instance }}) description : " SQL server instance is down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : SQL Server deadlock (instance {{ $labels.instance }}) description : " SQL Server is having some deadlock. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }} [copy]
  - alert: PatroniHasNoLeader
    expr: (max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader) < 1)
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Patroni has no Leader (instance {{ $labels.instance }})
      description: "A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PgbouncerActiveConnections
    expr: pgbouncer_pools_server_active_connections > 200
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer active connections (instance {{ $labels.instance }})
      description: "PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console. [copy]
  - alert: PgbouncerErrors
    expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[1m]) > 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: PGBouncer errors (instance {{ $labels.instance }})
      description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PgbouncerMaxConnections
    expr: increase(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[30s]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: PGBouncer max connections (instance {{ $labels.instance }})
      description: "The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RedisMissingMaster
    expr: (count(redis_instance_info{role="master"}) or vector(0)) < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis missing master (instance {{ $labels.instance }})
      description: "Redis cluster has no node marked as master.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RedisTooManyMasters
    expr: count(redis_instance_info{role="master"}) > 




    
1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis too many masters (instance {{ $labels.instance }})
      description: "Redis cluster has too many nodes marked as master.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RedisDisconnectedSlaves
    expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis disconnected slaves (instance {{ $labels.instance }})
      description: "Redis not replicating for all slaves. Consider reviewing the redis replication status.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Redis replication broken (instance {{ $labels.instance }}) description : " Redis instance lost a slave \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping). [copy] severity : critical annotations : summary : Redis cluster flapping (instance {{ $labels.instance }}) description : " Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping). \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: RedisMissingBackup
    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Redis missing backup (instance {{ $labels.instance }})
      description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable.
  - alert: RedisOutOfSystemMemory
    expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis out of system memory (instance {{ $labels.instance }})
      description: "Redis is running out of system memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RedisOutOfConfiguredMaxmemory
    expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis out of configured maxmemory (instance {{ $labels.instance }})
      description: "Redis is running out of configured maxmemory (> 90%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RedisTooManyConnections
    expr: redis_connected_clients / redis_config_maxclients * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Redis too many connections (instance {{ $labels.instance }})
      description: "Redis is running out of connections (> 90% used)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Redis not enough connections (instance {{ $labels.instance }}) description : " Redis instance should have more connections (> 5) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary




    
: Redis rejected connections (instance {{ $labels.instance }})
      description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : MongoDB Down (instance {{ $labels.instance }}) description : " MongoDB instance is down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Mongodb replica member unhealthy (instance {{ $labels.instance }}) description : " MongoDB replica member is not healthy \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: MongodbReplicationLag
    expr: (mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"}) / 1000 > 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication lag (instance {{ $labels.instance }})
      description: "Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbReplicationHeadroom
    expr: sum(avg(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp)) - sum(avg(mongodb_rs_members_optimeDate{member_state="PRIMARY"} - on (set) group_right mongodb_rs_members_optimeDate{member_state="SECONDARY"})) <= 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication headroom (instance {{ $labels.instance }})
      description: "MongoDB replication headroom is <= 0\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbNumberCursorsOpen
    expr: mongodb_ss_metrics_cursor_open{csr_type="total"} > 10 * 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB number cursors open (instance {{ $labels.instance }})
      description: "Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbCursorsTimeouts
    expr: increase(mongodb_ss_metrics_cursor_timedOut[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
      description: "Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbTooManyConnections
    expr: avg by(instance) (rate(mongodb_ss_connections{conn_type="current"}[1m])) / avg by(instance) (sum (mongodb_ss_connections) by (instance)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB too many connections (instance {{ $labels.instance }})
      description: "Too many connections (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbReplicationLag
    expr: avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication lag (instance {{ $labels.instance }})
      description: "Mongodb replication lag is more than 10s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync [copy] severity : critical annotations : summary : MongoDB replication Status 3 (instance {{ $labels.instance }}) description : " MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : MongoDB replication Status 6 (instance {{ $labels.instance }}) description : " MongoDB Replication set member as seen from another member of the set, is not yet known \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : MongoDB replication Status 8 (instance {{ $labels.instance }}) description : " MongoDB Replication set member as seen from another member of the set, is unreachable \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" MongoDB Replication set member is actively performing a rollback. Data is not available for reads [copy] severity : critical annotations : summary : MongoDB replication Status 9 (instance {{ $labels.instance }}) description : " MongoDB Replication set member is actively performing a rollback. Data is not available for reads \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: MongodbReplicationStatus10
    expr: mongodb_replset_member_state == 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: MongoDB replication Status 10 (instance {{ $labels.instance }})
      description: "MongoDB Replication set member was once in a replica set but was subsequently removed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbNumberCursorsOpen
    expr: mongodb_metrics_cursor_open{state="total_open"} > 10000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB number cursors open (instance {{ $labels.instance }})
      description: "Too many cursors opened by MongoDB for clients (> 10k)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbCursorsTimeouts
    expr: increase(mongodb_metrics_cursor_timed_out_total[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB cursors timeouts (instance {{ $labels.instance }})
      description: "Too many cursors are timing out\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbTooManyConnections
    expr: avg by(instance) (rate(mongodb_connections{state="current"}[1m])) / avg by(instance) (sum (mongodb_connections) by (instance)) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB too many connections (instance {{ $labels.instance }})
      description: "Too many connections (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MongodbVirtualMemoryUsage
    expr: (sum(mongodb_memory{type="virtual"}) BY (instance) / sum(mongodb_memory{type="mapped"}) BY (instance)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: MongoDB virtual memory usage (instance {{ $labels.instance }})
      description: "High memory usage\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: MgobBackupFailed
    expr: changes(mgob_scheduler_backup_total{status="500"}[1h]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Mgob backup failed (instance {{ $labels.instance }})
      description: "MongoDB backup has failed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : RabbitMQ node down (instance {{ $labels.instance }}) description : " Less than 3 nodes running in RabbitMQ cluster \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : RabbitMQ node not distributed (instance {{ $labels.instance }}) description : " Distribution link state is not 'up' \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: RabbitmqInstancesDifferentVersions
    expr: count(count(rabbitmq_build_info) by (rabbitmq_version)) > 1
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ instances different versions (instance {{ $labels.instance }})
      description: "Running different version of RabbitMQ in the same cluster, can lead to failure.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RabbitmqMemoryHigh
    expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ memory high (instance {{ $labels.instance }})
      description: "A node use more than 90% of allocated RAM\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RabbitmqFileDescriptorsUsage
    expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ file descriptors usage (instance {{ $labels.instance }})
      description: "A node use more than 90% of file descriptors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: RabbitmqTooManyUnackMessages
    expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many unack messages (instance {{ $labels.instance }})
      description: "Too many unacknowledged messages\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : RabbitMQ too many connections (instance {{ $labels.instance }}) description : " The total connections of a node is too high \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : RabbitMQ no queue consumer (instance {{ $labels.instance }}) description : " A queue has less than 1 consumer \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: RabbitmqUnroutableMessages
    expr: increase(rabbitmq_channel_messages_unroutable_returned_total[1m]) > 0 or increase(rabbitmq_channel_messages_unroutable_dropped_total[1m]) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ unroutable messages (instance {{ $labels.instance }})
      description: "A queue has unroutable messages\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : RabbitMQ down (instance {{ $labels.instance }}) description : " RabbitMQ node down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : RabbitMQ cluster down (instance {{ $labels.instance }}) description : " Less than 3 nodes running in RabbitMQ cluster \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : RabbitMQ cluster partition (instance {{ $labels.instance }}) description : " Cluster partition \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: RabbitmqOutOfMemory
    expr: rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ out of memory (instance {{ $labels.instance }})
      description: "Memory available for RabbmitMQ is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : RabbitMQ too many connections (instance {{ $labels.instance }}) description : " RabbitMQ instance has too many connections (> 1000) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  # Indicate the queue name in dedicated label.
  - alert: RabbitmqDeadLetterQueueFillingUp
    expr: rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ dead letter queue filling up (instance {{ $labels.instance }})
      description: "Dead letter queue is filling up (> 10 msgs)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Indicate the queue name in dedicated label.
  - alert: RabbitmqTooManyMessagesInQueue
    expr: rabbitmq_queue_messages_ready{queue="my-queue"} > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ too many messages in queue (instance {{ $labels.instance }})
      description: "Queue is filling up (> 1000 msgs)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Indicate the queue name in dedicated label.
  - alert: RabbitmqSlowQueueConsuming
    expr: time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ slow queue consuming (instance {{ $labels.instance }})
      description: "Queue messages are consumed slowly (> 60s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : RabbitMQ no consumer (instance {{ $labels.instance }}) description : " Queue has no consumer \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  # Indicate the queue name in dedicated label.
  - alert: RabbitmqTooManyConsumers
    expr: rabbitmq_queue_consumers{queue="my-queue"} > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: RabbitMQ too many consumers (instance {{ $labels.instance }})
      description: "Queue should have only 1 consumer\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Indicate the exchange name in dedicated label.
  - alert: RabbitmqUnactiveExchange
    expr: rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: RabbitMQ unactive exchange (instance {{ $labels.instance }})
      description: "Exchange receive less than 5 msgs per second\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHeapUsageTooHigh
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})
      description: "The heap usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHeapUsageWarning
    expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Heap Usage warning (instance {{ $labels.instance }})
      description: "The heap usage is over 80%\n  VALUE




    
 = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchDiskOutOfSpace
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch disk out of space (instance {{ $labels.instance }})
      description: "The disk usage is over 90%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch disk space low (instance {{ $labels.instance }})
      description: "The disk usage is over 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchClusterRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Cluster Red (instance {{ $labels.instance }})
      description: "Elastic Cluster Red status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchClusterYellow
    expr: elasticsearch_cluster_health_status{color="yellow"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch Cluster Yellow (instance {{ $labels.instance }})
      description: "Elastic Cluster Yellow status\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHealthyNodes
    expr: elasticsearch_cluster_health_number_of_nodes < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
      description: "Missing node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHealthyDataNodes
    expr: elasticsearch_cluster_health_number_of_data_nodes < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch Healthy Data Nodes (instance {{ $labels.instance }})
      description: "Missing data node in Elasticsearch cluster\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchRelocatingShards
    expr: elasticsearch_cluster_health_relocating_shards > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Elasticsearch relocating shards (instance {{ $labels.instance }})
      description: "Elasticsearch is relocating shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchRelocatingShardsTooLong
    expr: elasticsearch_cluster_health_relocating_shards > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch relocating shards too long (instance {{ $labels.instance }})
      description: "Elasticsearch has been relocating shards for 15min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchInitializingShards
    expr: elasticsearch_cluster_health_initializing_shards > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Elasticsearch initializing shards (instance {{ $labels.instance }})
      description: "Elasticsearch is initializing shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchInitializingShardsTooLong
    expr: elasticsearch_cluster_health_initializing_shards > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch initializing shards too long (instance {{ $labels.instance }})
      description: "Elasticsearch has been initializing shards for 15 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchUnassignedShards
    expr: elasticsearch_cluster_health_unassigned_shards > 0
    for:




    
 0m
    labels:
      severity: critical
    annotations:
      summary: Elasticsearch unassigned shards (instance {{ $labels.instance }})
      description: "Elasticsearch has unassigned shards\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchPendingTasks
    expr: elasticsearch_cluster_health_number_of_pending_tasks > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch pending tasks (instance {{ $labels.instance }})
      description: "Elasticsearch has pending tasks. Cluster works slowly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchNoNewDocuments
    expr: increase(elasticsearch_indices_indexing_index_total{es_data_node="true"}[10m]) < 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch no new documents (instance {{ $labels.instance }})
      description: "No new documents for 10 min!\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHighIndexingLatency
    expr: elasticsearch_indices_indexing_index_time_seconds_total / elasticsearch_indices_indexing_index_total > 0.0005
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Indexing Latency (instance {{ $labels.instance }})
      description: "The indexing latency on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHighIndexingRate
    expr: sum(rate(elasticsearch_indices_indexing_index_total[1m]))> 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Indexing Rate (instance {{ $labels.instance }})
      description: "The indexing rate on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHighQueryRate
    expr: sum(rate(elasticsearch_indices_search_query_total[1m])) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Query Rate (instance {{ $labels.instance }})
      description: "The query rate on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ElasticsearchHighQueryLatency
    expr: elasticsearch_indices_search_fetch_time_seconds / elasticsearch_indices_search_fetch_total > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Elasticsearch High Query Latency (instance {{ $labels.instance }})
      description: "The query latency on Elasticsearch cluster is higher than the threshold.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }} [copy]
  - alert: CassandraNodeIsUnavailable
    expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint) < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
      description: "Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraManyCompactionTasksArePending
    expr: cassandra_table_estimated_pending_compactions > 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_commit_log_pending_tasks > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Cassandra 




    
commitlog pending tasks - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraConnectionTimeoutsTotal
    expr: avg(cassandra_client_request_timeouts_total) by (cassandra_cluster,instance) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraStorageExceptions
    expr: changes(cassandra_storage_exceptions_total[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraTombstoneDump
    expr: avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }} [copy]
  - alert: CassandraClientRequestUnavailableWrite
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unavailable to write - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraClientRequestUnavailableRead
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unavailable to read - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy]
  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }} [copy]
  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraHintsCount
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra hints count (instance {{ $labels.instance }})
      description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster. [copy]
  - alert: CassandraCompactionTaskPending
    expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction task pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraViewwriteLatency
    expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
      description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraBadHacker
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra bad hacker (instance {{ $labels.instance }})
      description: "Increase of Cassandra authentication failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraNodeDown
    expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra node down (instance {{ $labels.instance }})
      description: "Cassandra node down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Unexpected number of Cassandra commitlog pending tasks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels




    
 }}"
  - alert: CassandraRepairPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are pending\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraRepairBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraConnectionTimeoutsTotal
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraStorageExceptions
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraTombstoneDump
    expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Too much tombstones scanned in queries\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraClientRequestUnavailableWrite
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable write (instance {{ $labels.instance }})
      description: "Write failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraClientRequestUnavailableRead
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unavailable read (instance {{ $labels.instance }})
      description: "Read failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy]
  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large. [copy]
  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 




    
0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CassandraCacheHitRateKeyCache
    expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
      description: "Key cache hit rate is below 85%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseMemoryUsageCritical
    expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Memory Usage Critical (instance {{ $labels.instance }})
      description: "Memory usage is critically high, over 90%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseMemoryUsageWarning
    expr: ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Memory Usage Warning (instance {{ $labels.instance }})
      description: "Memory usage is over 80%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseDiskSpaceLowOnDefault
    expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Disk Space Low on Default (instance {{ $labels.instance }})
      description: "Disk space on default is below 20%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseDiskSpaceCriticalOnDefault
    expr: ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Disk Space Critical on Default (instance {{ $labels.instance }})
      description: "Disk space on default disk is critically low, below 10%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseDiskSpaceLowOnBackups
    expr: ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Disk Space Low on Backups (instance {{ $labels.instance }})
      description: "Disk space on backups is below 20%.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseReplicaErrors
    expr: ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse Replica Errors (instance {{ $labels.instance }})
      description: "Critical replica errors detected, either all replicas are stale or lost.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ClickhouseNoAvailableReplicas
    expr: ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse No Available Replicas (instance {{ $labels.instance }})
      description: "No available replicas in ClickHouse.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels




    
 }}"
  - alert: ClickhouseNoLiveReplicas
    expr: ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: ClickHouse No Live Replicas (instance {{ $labels.instance }})
      description: "There are too few live replicas available, risking data loss and service disruption.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Please replace the threshold with an appropriate value
  - alert: ClickhouseHighNetworkTraffic
    expr: ClickHouseMetrics_NetworkSend > 250 or ClickHouseMetrics_NetworkReceive > 250
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse High Network Traffic (instance {{ $labels.instance }})
      description: "Network traffic is unusually high, may affect cluster performance.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  # Please replace the threshold with an appropriate value
  - alert: ClickhouseHighTcpConnections
    expr: ClickHouseMetrics_TCPConnection > 400
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse High TCP Connections (instance {{ $labels.instance }})
      description: "High number of TCP connections, indicating heavy client or inter-cluster communication.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
An increase in interserver connections may indicate replication or distributed query handling issues. [copy]
  - alert: ClickhouseInterserverConnectionIssues
    expr: increase(ClickHouseMetrics_InterserverConnection[5m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse Interserver Connection Issues (instance {{ $labels.instance }})
      description: "An increase in interserver connections may indicate replication or distributed query handling issues.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination. [copy]
  - alert: ClickhouseZookeeperConnectionIssues
    expr: avg(ClickHouseMetrics_ZooKeeperSession) != 1
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: ClickHouse ZooKeeper Connection Issues (instance {{ $labels.instance }})
      description: "ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Authentication failures detected, indicating potential security issues or misconfiguration. [copy]
  - alert: ClickhouseAuthenticationFailures
    expr: increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: ClickHouse Authentication Failures (instance {{ $labels.instance }})
      description: "Authentication failures detected, indicating potential security issues or misconfiguration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts. [copy]
  - alert: ClickhouseAccessDeniedErrors
    expr: increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0
    for: 0m
    labels:
      severity: info
    annotations:
      summary: ClickHouse Access Denied Errors (instance {{ $labels.instance }})
      description: "Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Zookeeper Down (instance {{ $labels.instance }}) description : " Zookeeper down on instance {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Zookeeper missing leader (instance {{ $labels.instance }}) description : " Zookeeper cluster has no node marked as leader \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Zookeeper Too Many Leaders (instance {{ $labels.instance }}) description : " Zookeeper cluster has too many nodes marked as leader \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Zookeeper Not Ok (instance {{ $labels.instance }}) description : " Zookeeper instance is not ok \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: KafkaTopicsReplicas
    expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kafka topics replicas (instance {{ $labels.instance }})
      description: "Kafka topic in-sync partition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KafkaConsumersGroup
    expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kafka consumers group (instance {{ $labels.instance }})
      description: "Kafka consumers group\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KafkaTopicOffsetDecreased
    expr: delta(kafka_burrow_partition_current_offset[1m]) < 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kafka topic offset decreased (instance {{ $labels.instance }})
      description: "Kafka topic offset has decreased\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KafkaConsumerLag
    expr: kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset >= (kafka_burrow_topic_partition_offset offset 15m - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset offset 15m) AND kafka_burrow_topic_partition_offset - on(partition, cluster, topic) group_right() kafka_burrow_partition_current_offset > 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Kafka consumer lag (instance {{ $labels.instance }})
      description: "Kafka consumer has a 30 minutes and increasing lag\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarSubscriptionHighNumberOfBacklogEntries
    expr: sum(pulsar_subscription_back_log) by (subscription) > 5000
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})
      description: "The number of subscription backlog entries is over 5k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarSubscriptionVeryHighNumberOfBacklogEntries
    expr: sum(pulsar_subscription_back_log) by (subscription) > 100000
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})
      description: "The number of subscription backlog entries is over 100k\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarTopicLargeBacklogStorageSize
    expr: sum(pulsar_storage_size > 5*1024*1024*1024) by (topic)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar topic large backlog storage size (instance {{ $labels.instance }})
      description: "The topic backlog storage size is over 5 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarTopicVeryLargeBacklogStorageSize
    expr: sum(pulsar_storage_size > 20*1024*1024*1024) by (topic)
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar topic very large backlog storage size (instance {{ $labels.instance }})
      description: "The topic backlog storage size is over 




    
20 GB\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarHighWriteLatency
    expr: sum(pulsar_storage_write_latency_overflow > 0) by (topic)
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar high write latency (instance {{ $labels.instance }})
      description: "Messages cannot be written in a timely fashion\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarLargeMessagePayload
    expr: sum(pulsar_entry_size_overflow > 0) by (topic)
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: Pulsar large message payload (instance {{ $labels.instance }})
      description: "Observing large message payload (> 1MB)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarHighLedgerDiskUsage
    expr: sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75
    for: 1h
    labels:
      severity: critical
    annotations:
      summary: Pulsar high ledger disk usage (instance {{ $labels.instance }})
      description: "Observing Ledger Disk Usage (> 75%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarReadOnlyBookies
    expr: count(bookie_SERVER_STATUS{} == 0) by (pod)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Pulsar read only bookies (instance {{ $labels.instance }})
      description: "Observing Readonly Bookies\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarHighNumberOfFunctionErrors
    expr: sum((rate(pulsar_function_user_exceptions_total{}[1m]) + rate(pulsar_function_system_exceptions_total{}[1m])) > 10) by (name)
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Pulsar high number of function errors (instance {{ $labels.instance }})
      description: "Observing more than 10 Function errors per minute\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: PulsarHighNumberOfSinkErrors
    expr: sum(rate(pulsar_sink_sink_exceptions_total{}[1m]) > 10) by (name)
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Pulsar high number of sink errors (instance {{ $labels.instance }})
      description: "Observing more than 10 Sink errors per minute\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Nats high connection count (instance {{ $labels.instance }}) description : " High number of NATS connections ({{ $value }}) for {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Nats high pending bytes (instance {{ $labels.instance }}) description : " High number of NATS pending bytes ({{ $value }}) for {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Nats high subscriptions count (instance {{ $labels.instance }}) description : " High number of NATS subscriptions ({{ $value }}) for {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Nats high routes count (instance {{ $labels.instance }}) description : " High number of NATS routes ({{ $value }}) for {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]
  - alert: SolrUpdateErrors
    expr: increase(solr_metrics_core_update_handler_errors_total[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Solr update errors (instance {{ $labels.instance }})
      description: "Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]
  - alert: SolrQueryErrors
    expr: increase(solr_metrics_core_errors_total{category="QUERY"}[1m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Solr query errors (instance {{ $labels.instance }})
      description: "Solr has increased query errors in collection {{ $labels.collection }} for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy]
  - alert: SolrReplicationErrors
    expr: increase(solr_metrics_core_errors_total{category="REPLICATION"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Solr replication errors (instance {{ $labels.instance }})
      description: "Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}. [copy] severity : critical annotations : summary : Solr low live node count (instance {{ $labels.instance }}) description : " Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Hadoop Name Node Down (instance {{ $labels.instance }}) description : " The Hadoop NameNode service is unavailable. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: HadoopResourceManagerDown
    expr: up{job="hadoop-resourcemanager"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Hadoop Resource Manager Down (instance {{ $labels.instance }})
      description: "The Hadoop ResourceManager service is unavailable.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HadoopDataNodeOutOfService
    expr: hadoop_datanode_last_heartbeat == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop Data Node Out Of Service (instance {{ $labels.instance }})
      description: "The Hadoop DataNode is not sending heartbeats.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HadoopHdfsDiskSpaceLow
    expr: (hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})
      description: "Available HDFS disk space is running low.\n  VALUE = {{ $value }}\n  LABELS = 




    
{{ $labels }}"
  - alert: HadoopMapReduceTaskFailures
    expr: hadoop_mapreduce_task_failures_total > 100
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Hadoop Map Reduce Task Failures (instance {{ $labels.instance }})
      description: "There is an unusually high number of MapReduce task failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HadoopResourceManagerMemoryHigh
    expr: hadoop_resourcemanager_memory_bytes / hadoop_resourcemanager_memory_max_bytes > 0.8
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Hadoop Resource Manager Memory High (instance {{ $labels.instance }})
      description: "The Hadoop ResourceManager is approaching its memory limit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HadoopYarnContainerAllocationFailures
    expr: hadoop_yarn_container_allocation_failures_total > 10
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop YARN Container Allocation Failures (instance {{ $labels.instance }})
      description: "There is a significant number of YARN container allocation failures.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Hadoop HBase Region Count High (instance {{ $labels.instance }}) description : " The HBase cluster has an unusually high number of regions. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: HadoopHbaseRegionServerHeapLow
    expr: hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes < 0.2
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})
      description: "HBase Region Servers are running low on heap space.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HadoopHbaseWriteRequestsLatencyHigh
    expr: hadoop_hbase_write_requests_latency_seconds > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Hadoop HBase Write Requests Latency High (instance {{ $labels.instance }})
      description: "HBase Write Requests are experiencing high latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NginxHighHttp4xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NginxHighHttp5xxErrorRate
    expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: NginxLatencyHigh
    expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Nginx latency high (instance {{ $labels.instance }})
      description: "Nginx p99 latency is higher than 3 seconds\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Apache down (instance {{ $labels.instance }}) description : " Apache down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }} [copy]
  - alert: ApacheWorkersLoad
    expr: (sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Apache workers load (instance {{ $labels.instance }})
      description: "Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Apache restart (instance {{ $labels.instance }}) description : " Apache has just been restarted. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]
  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]
  - alert: HaproxyHighHttp5xxErrorRateBackend
    expr: ((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHighHttp4xxErrorRateServer
    expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHighHttp5xxErrorRateServer
    expr: ((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyServerResponseErrors
    expr: (sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server response errors (instance {{ $labels.instance }})
      description: "Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy]
  - alert: HaproxyBackendConnectionErrors
    expr: (sum by (proxy) (rate(haproxy_backend_connection_errors_total[1m]))) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request




    
 throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy]
  - alert: HaproxyServerConnectionErrors
    expr: (sum by (proxy) (rate(haproxy_server_connection_errors_total[1m]))) > 100
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf "%.2f"}}% [copy]
  - alert: HaproxyBackendMaxActiveSession>80%
    expr: ((haproxy_server_max_sessions >0) * 100) / (haproxy_server_limit_sessions > 0) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy backend max active session > 80% (instance {{ $labels.instance }})
      description: "Session limit from backend {{ $labels.proxy }} to server {{ $labels.server }} reached 80% of limit - {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyPendingRequests
    expr: sum by (proxy) (rate(haproxy_backend_current_queue[2m])) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy pending requests (instance {{ $labels.instance }})
      description: "Some HAProxy requests are pending on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHttpSlowingDown
    expr: avg by (instance, proxy) (haproxy_backend_max_total_time_seconds) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
      description: "Average request time is increasing - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyRetryHigh
    expr: sum by (proxy) (rate(haproxy_backend_retry_warnings_total[1m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy retry high (instance {{ $labels.instance }})
      description: "High rate of retry on {{ $labels.proxy }} - {{ $value | printf \"%.2f\"}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHasNoAliveBackends
    expr: haproxy_backend_active_servers + haproxy_backend_backup_servers == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAproxy has no alive backends (instance {{ $labels.instance }})
      description: "HAProxy has no alive active or backup backends for {{ $labels.proxy }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyFrontendSecurityBlockedRequests
    expr: sum by (proxy) (rate(haproxy_frontend_denied_connections_total[2m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
      description: "HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyServerHealthcheckFailure
    expr: increase(haproxy_server_check_failures_total[1m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
      description: "Some server healthcheck are failing on {{ $labels.server 




    
}}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : HAProxy down (instance {{ $labels.instance }}) description : " HAProxy down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]
  - alert: HaproxyHighHttp4xxErrorRateBackend
    expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }} [copy]
  - alert: HaproxyHighHttp5xxErrorRateBackend
    expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHighHttp4xxErrorRateServer
    expr: sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHighHttp5xxErrorRateServer
    expr: sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})
      description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyServerResponseErrors
    expr: sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server response errors (instance {{ $labels.instance }})
      description: "Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high. [copy]
  - alert: HaproxyBackendConnectionErrors
    expr: sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) > 100
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: HAProxy backend connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Too many connection errors to {{ $labels.server }} server (> 100 req/s). Request throughput may be too high. [copy]
  - alert: HaproxyServerConnectionErrors
    expr: sum by (server) (rate(haproxy_server_connection_errors_total[1m])) > 100
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: HAProxy server connection errors (instance {{ $labels.instance }})
      description: "Too many connection errors to {{ $labels.server }} server (> 100 




    
req/s). Request throughput may be too high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%). [copy]
  - alert: HaproxyBackendMaxActiveSession
    expr: ((sum by (backend) (avg_over_time(haproxy_backend_current_sessions[2m]) * 100) / sum by (backend) (avg_over_time(haproxy_backend_limit_sessions[2m])))) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy backend max active session (instance {{ $labels.instance }})
      description: "HAproxy backend {{ $labels.fqdn }}/{{ $labels.backend }} is reaching session limit (> 80%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyPendingRequests
    expr: sum by (backend) (haproxy_backend_current_queue) > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy pending requests (instance {{ $labels.instance }})
      description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyHttpSlowingDown
    expr: avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy HTTP slowing down (instance {{ $labels.instance }})
      description: "Average request time is increasing\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyRetryHigh
    expr: sum by (backend) (rate(haproxy_backend_retry_warnings_total[1m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy retry high (instance {{ $labels.instance }})
      description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : HAProxy backend down (instance {{ $labels.instance }}) description : " HAProxy backend is down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : HAProxy server down (instance {{ $labels.instance }}) description : " HAProxy server is down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: HaproxyFrontendSecurityBlockedRequests
    expr: sum by (frontend) (rate(haproxy_frontend_requests_denied_total[2m])) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: HAProxy frontend security blocked requests (instance {{ $labels.instance }})
      description: "HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: HaproxyServerHealthcheckFailure
    expr: increase(haproxy_server_check_failures_total[1m]) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: HAProxy server healthcheck failure (instance {{ $labels.instance }})
      description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikServiceDown
    expr: count(traefik_service_server_up) by (service) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Traefik service down (instance {{ $labels.instance }})
      description: "All Traefik services are down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikHighHttp4xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})




    

      description: "Traefik service 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikHighHttp5xxErrorRateService
    expr: sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})
      description: "Traefik service 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikBackendDown
    expr: count(traefik_backend_server_up) by (backend) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Traefik backend down (instance {{ $labels.instance }})
      description: "All Traefik backends are down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikHighHttp4xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})
      description: "Traefik backend 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: TraefikHighHttp5xxErrorRateBackend
    expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})
      description: "Traefik backend 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: Php-fpmMax-childrenReached
    expr: sum(phpfpm_max_children_reached_total) by (instance) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: PHP-FPM max-children reached (instance {{ $labels.instance }})
      description: "PHP-FPM reached max children - {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: JvmMemoryFillingUp
    expr: (sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: JVM memory filling up (instance {{ $labels.instance }})
      description: "JVM memory is filling up (> 80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Sidekiq queue size (instance {{ $labels.instance }}) description : " Sidekiq queue {{ $labels.name }} is growing \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing. [copy]
  - alert: SidekiqSchedulingLatencyTooHigh
    expr: max(sidekiq_queue_latency) > 60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Sidekiq scheduling latency too high (instance {{ $labels.instance }})
      description: "Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesNodeNotReady
    expr: kube_node_status_condition{condition="Ready",status="true"} == 0
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node not ready (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has been unready for a long time\n  VALUE = {{ $value 




    
}}\n  LABELS = {{ $labels }}"
  - alert: KubernetesNodeMemoryPressure
    expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node memory pressure (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has MemoryPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesNodeDiskPressure
    expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node disk pressure (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has DiskPressure condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesNodeNetworkUnavailable
    expr: kube_node_status_condition{condition="NetworkUnavailable",status="true"} == 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Node network unavailable (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} has NetworkUnavailable condition\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesNodeOutOfPodCapacity
    expr: sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Node out of pod capacity (instance {{ $labels.instance }})
      description: "Node {{ $labels.node }} is out of pod capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes. [copy]
  - alert: KubernetesContainerOomKiller
    expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
      description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Kubernetes Job failed (instance {{ $labels.instance }}) description : " Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Kubernetes CronJob suspended (instance {{ $labels.instance }}) description : " CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending [copy]
  - alert: KubernetesPersistentvolumeclaimPending
    expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})
      description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesVolumeOutOfDiskSpace
    expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Volume out of disk space (instance {{ $labels.instance }})
      description: "Volume is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = 




    
{{ $labels }}"
Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available. [copy]
  - alert: KubernetesVolumeFullInFourDays
    expr: predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Volume full in four days (instance {{ $labels.instance }})
      description: "Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesPersistentvolumeError
    expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending", job="kube-state-metrics"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes PersistentVolume error (instance {{ $labels.instance }})
      description: "Persistent volume {{ $labels.persistentvolume }} is in bad state\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesStatefulsetDown
    expr: kube_statefulset_replicas != kube_statefulset_status_replicas_ready > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet down (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} went down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesHpaScaleInability
    expr: (kube_horizontalpodautoscaler_spec_max_replicas - kube_horizontalpodautoscaler_status_desired_replicas) * on (horizontalpodautoscaler,namespace) (kube_horizontalpodautoscaler_status_condition{condition="ScalingLimited", status="true"} == 1) == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA scale inability (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to scale\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics [copy]
  - alert: KubernetesHpaMetricsUnavailability
    expr: kube_horizontalpodautoscaler_status_condition{status="false", condition="ScalingActive"} == 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes HPA metrics unavailability (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is unable to collect metrics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods [copy]
  - alert: KubernetesHpaScaleMaximum
    expr: (kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)
    for: 2m
    labels:
      severity: info
    annotations:
      summary: Kubernetes HPA scale maximum (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here. [copy]
  - alert: KubernetesHpaUnderutilized
    expr: max(quantile_over_time(0.5, kube_horizontalpodautoscaler_status_desired_replicas[1d]) == kube_horizontalpodautoscaler_spec_min_replicas) by (horizontalpodautoscaler) > 3
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Kubernetes HPA underutilized (instance {{ $labels.instance }})
      description: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes. [copy]
  - alert: KubernetesPodNotHealthy
    expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
      description: "Pod {{ $labels.namespace 




    
}}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesPodCrashLooping
    expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
      description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesReplicasetReplicasMismatch
    expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes ReplicaSet replicas mismatch (instance {{ $labels.instance }})
      description: "ReplicaSet {{ $labels.namespace }}/{{ $labels.replicaset }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesDeploymentReplicasMismatch
    expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})
      description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} replicas mismatch\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesStatefulsetReplicasMismatch
    expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})
      description: "StatefulSet does not match the expected number of replicas.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back. [copy]
  - alert: KubernetesDeploymentGenerationMismatch
    expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})
      description: "Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back. [copy]
  - alert: KubernetesStatefulsetGenerationMismatch
    expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has failed but has not been rolled back.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. [copy]
  - alert: KubernetesStatefulsetUpdateNotRolledOut
    expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})
      description: "StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready [copy]
  - alert: KubernetesDaemonsetRolloutStuck
    expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
      description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not 




    
ready\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run [copy]
  - alert: KubernetesDaemonsetMisscheduled
    expr: kube_daemonset_status_number_misscheduled > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})
      description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. [copy]
  # Threshold should be customized for each cronjob name.
  - alert: KubernetesCronjobTooLong
    expr: time() - kube_cronjob_next_schedule_time > 3600
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes CronJob too long (instance {{ $labels.instance }})
      description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesJobSlowCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded - kube_job_status_failed > 0
    for: 12h
    labels:
      severity: critical
    annotations:
      summary: Kubernetes Job slow completion (instance {{ $labels.instance }})
      description: "Kubernetes Job {{ $labels.namespace }}/{{ $labels.job_name }} did not complete in time.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesApiServerErrors
    expr: sum(rate(apiserver_request_total{job="apiserver",code=~"(?:5..)"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) * 100 > 3
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API server errors (instance {{ $labels.instance }})
      description: "Kubernetes API server is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesApiClientErrors
    expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes API client errors (instance {{ $labels.instance }})
      description: "Kubernetes API client is experiencing high error rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: KubernetesClientCertificateExpiresNextWeek
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes client certificate expires next week (instance {{ $labels.instance }})
      description: "A client certificate used to authenticate to the apiserver is expiring next week.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. [copy]
  - alert: KubernetesClientCertificateExpiresSoon
    expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Kubernetes client certificate expires soon (instance {{ $labels.instance }})
      description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. [copy]
  - alert: KubernetesApiServerLatency
    expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~"(?:CONNECT|WATCHLIST|WATCH|PROXY)"} [10m])) WITHOUT (subresource)) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Kubernetes API server latency (instance {{ $labels.instance }})
      description: "Kubernetes API server has a 99th percentile 




    
latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Nomad job failed (instance {{ $labels.instance }}) description : " Nomad job failed \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Nomad job lost (instance {{ $labels.instance }}) description : " Nomad job lost \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Nomad job queued (instance {{ $labels.instance }}) description : " Nomad job queued \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: NomadBlockedEvaluation
    expr: nomad_nomad_blocked_evals_total_blocked > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Nomad blocked evaluation (instance {{ $labels.instance }})
      description: "Nomad blocked evaluation\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ConsulServiceHealthcheckFailed
    expr: consul_catalog_service_node_healthy == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Consul service healthcheck failed (instance {{ $labels.instance }})
      description: "Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Consul missing master node (instance {{ $labels.instance }}) description : " Numbers of consul raft peers should be 3, in order to preserve quorum. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: ConsulAgentUnhealthy
    expr: consul_health_node_status{status="critical"} == 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Consul agent unhealthy (instance {{ $labels.instance }})
      description: "A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Etcd insufficient Members (instance {{ $labels.instance }}) description : " Etcd cluster should have an odd number of members \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Etcd no Leader (instance {{ $labels.instance }}) description : " Etcd cluster have no leader \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[10m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of leader changes (instance {{ $labels.instance }})
      description: "Etcd leader changed more than 2 times during 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations




    
:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: "More than 1% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfFailedGrpcRequests
    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed GRPC requests (instance {{ $labels.instance }})
      description: "More than 5% GRPC request failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdGrpcRequestsSlow
    expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[1m])) by (grpc_service, grpc_method, le)) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd GRPC requests slow (instance {{ $labels.instance }})
      description: "GRPC requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: "More than 1% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfFailedHttpRequests
    expr: sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Etcd high number of failed HTTP requests (instance {{ $labels.instance }})
      description: "More than 5% HTTP failure detected in Etcd\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHttpRequestsSlow
    expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[1m])) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd HTTP requests slow (instance {{ $labels.instance }})
      description: "HTTP requests slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdMemberCommunicationSlow
    expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd member communication slow (instance {{ $labels.instance }})
      description: "Etcd member communication slowing down, 99th percentile is over 0.15s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighNumberOfFailedProposals
    expr: increase(etcd_server_proposals_failed_total[1h]) > 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high number of failed proposals (instance {{ $labels.instance }})
      description: "Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighFsyncDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[1m])) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high fsync durations (instance {{ $labels.instance }})
      description: "Etcd WAL fsync duration increasing, 99th percentile is over 0.5s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: EtcdHighCommitDurations
    expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[1m])) > 0.25
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Etcd high commit durations (instance {{ $labels.instance }})
      description: "Etcd commit duration increasing, 99th percentile is




    
 over 0.25s\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10% [copy]
  - alert: LinkerdHighErrorRate
    expr: sum(rate(request_errors_total[1m])) by (deployment, statefulset, daemonset) / sum(rate(request_total[1m])) by (deployment, statefulset, daemonset) * 100 > 10
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Linkerd high error rate (instance {{ $labels.instance }})
      description: "Linkerd error rate for {{ $labels.deployment | $labels.statefulset | $labels.daemonset }} is over 10%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioKubernetesGatewayAvailabilityDrop
    expr: min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Kubernetes gateway availability drop (instance {{ $labels.instance }})
      description: "Gateway pods have dropped. Inbound traffic will likely be affected.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration. [copy]
  - alert: IstioPilotHighTotalRequestRate
    expr: sum(rate(pilot_xds_push_errors[1m])) / sum(rate(pilot_xds_pushes[1m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Pilot high total request rate (instance {{ $labels.instance }})
      description: "Number of Istio Pilot push errors is too high (> 5%). Envoy sidecars might have outdated configuration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly. [copy]
  - alert: IstioMixerPrometheusDispatchesLow
    expr: sum(rate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[1m])) < 180
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio Mixer Prometheus dispatches low (instance {{ $labels.instance }})
      description: "Number of Mixer dispatches to Prometheus is too low. Istio metrics might not be being exported properly.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioHighTotalRequestRate
    expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Istio high total request rate (instance {{ $labels.instance }})
      description: "Global request rate in the service mesh is unusually high.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioLowTotalRequestRate
    expr: sum(rate(istio_requests_total{reporter="destination"}[5m])) < 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Istio low total request rate (instance {{ $labels.instance }})
      description: "Global request rate in the service mesh is unusually low.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioHigh4xxErrorRate
    expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"4.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high 4xx error rate (instance {{ $labels.instance }})
      description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioHigh5xxErrorRate
    expr: sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high 5xx error rate (instance {{ $labels.instance }})
      description: "High percentage of HTTP 5xx responses in Istio (> 5%).\n  VALUE = {{ $value }}\n  LABELS 




    
= {{ $labels }}"
  - alert: IstioHighRequestLatency
    expr: rate(istio_request_duration_milliseconds_sum{reporter="destination"}[1m]) / rate(istio_request_duration_milliseconds_count{reporter="destination"}[1m]) > 100
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio high request latency (instance {{ $labels.instance }})
      description: "Istio average requests execution is longer than 100ms.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioLatency99Percentile
    expr: histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[1m])) by (destination_canonical_service, destination_workload_namespace, source_canonical_service, source_workload_namespace, le)) > 1000
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Istio latency 99 percentile (instance {{ $labels.instance }})
      description: "Istio 1% slowest requests are longer than 1000ms.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: IstioPilotDuplicateEntry
    expr: sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Istio Pilot Duplicate Entry (instance {{ $labels.instance }})
      description: "Istio pilot duplicate entry error.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ArgocdServiceNotSynced
    expr: argocd_app_info{sync_status!="Synced"} != 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: ArgoCD service not synced (instance {{ $labels.instance }})
      description: "Service {{ $labels.name }} run by argo is currently not in sync.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ArgocdServiceUnhealthy
    expr: argocd_app_info{health_status!="Healthy"} != 0
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: ArgoCD service unhealthy (instance {{ $labels.instance }})
      description: "Service {{ $labels.name }} run by argo is currently not healthy.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CephMonitorClockSkew
    expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Ceph monitor clock skew (instance {{ $labels.instance }})
      description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : Ceph monitor low space (instance {{ $labels.instance }}) description : " Ceph monitor storage is low. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Ceph OSD Down (instance {{ $labels.instance }}) description : " Ceph Object Storage Daemon Down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state. [copy] severity : warning annotations : summary : Ceph high OSD latency (instance {{ $labels.instance }}) description : " Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Ceph OSD low space (instance {{ $labels.instance }}) description : " Ceph Object Storage Daemon is going out of space. Please add more disks. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Ceph OSD reweighted (instance {{ $labels.instance }}) description : " Ceph Object Storage Daemon takes too much time to resize. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Ceph PG down (instance {{ $labels.instance }}) description : " Some Ceph placement groups are down. Please ensure that all the data are available. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Some Ceph placement groups are incomplete. Please ensure that all the data are available. [copy] severity : critical annotations : summary : Ceph PG incomplete (instance {{ $labels.instance }}) description : " Some Ceph placement groups are incomplete. Please ensure that all the data are available. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes. [copy] severity : warning annotations : summary : Ceph PG inconsistent (instance {{ $labels.instance }}) description : " Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : Ceph PG activation long (instance {{ $labels.instance }}) description : " Some Ceph placement groups are too long to activate. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules. [copy] severity : warning annotations : summary : Ceph PG backfill full (instance {{ $labels.instance }}) description : " Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Ceph PG unavailable (instance {{ $labels.instance }}) description : " Some Ceph placement groups are unavailable. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: SpeedtestSlowInternetDownload
    expr: avg_over_time(speedtest_download[10m]) < 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Download (instance {{ $labels.instance }})
      description: "Internet download speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: SpeedtestSlowInternetUpload
    expr: avg_over_time(speedtest_upload[10m]) < 20
    for




    
: 0m
    labels:
      severity: warning
    annotations:
      summary: SpeedTest Slow Internet Upload (instance {{ $labels.instance }})
      description: "Internet upload speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : ZFS offline pool (instance {{ $labels.instance }}) description : " A ZFS zpool is in a unexpected state: {{ $labels.state }}. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: ZfsPoolOutOfSpace
    expr: zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: ZFS pool out of space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : ZFS pool unhealthy (instance {{ $labels.instance }}) description : " ZFS pool state is {{ $value }}. See comments for more information. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : warning annotations : summary : ZFS collector failed (instance {{ $labels.instance }}) description : " ZFS collector for {{ $labels.instance }} has failed to collect information \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: OpenebsUsedPoolCapacity
    expr: openebs_used_pool_capacity_percent > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: OpenEBS used pool capacity (instance {{ $labels.instance }})
      description: "OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Minio cluster disk offline (instance {{ $labels.instance }}) description : " Minio cluster disk is offline \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Minio node disk offline (instance {{ $labels.instance }}) description : " Minio cluster node disk is offline \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: MinioDiskSpaceUsage
    expr: disk_storage_available / disk_storage_total * 100 < 10
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Minio disk space usage (instance {{ $labels.instance }})
      description: "Minio available free space is low (< 10%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : SSL certificate probe failed (instance {{ $labels.instance }}) description : " Failed to fetch SSL information {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: SslCertificateOscpStatusUnknown
    expr: ssl_ocsp_response_status == 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SSL certificate OSCP status unknown (instance {{ $labels.instance }})
      description: "




    
Failed to get the OSCP status {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : SSL certificate revoked (instance {{ $labels.instance }}) description : " SSL certificate revoked {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: SslCertificateExpiry(<7Days)
    expr: ssl_verified_cert_not_after{chain_no="0"} - time() < 86400 * 7
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: SSL certificate expiry (< 7 days) (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} Certificate is expiring in 7 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Juniper switch down (instance {{ $labels.instance }}) description : " The switch appears to be down \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: JuniperHighBandwidthUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.90
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
      description: "Interface is highly saturated. (> 0.90GiB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: JuniperHighBandwidthUsage1gib
    expr: rate(junos_interface_transmit_bytes[1m]) * 8 > 1e+9 * 0.80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Juniper high Bandwidth Usage 1GiB (instance {{ $labels.instance }})
      description: "Interface is getting saturated. (> 0.80GiB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : CoreDNS Panic Count (instance {{ $labels.instance }}) description : " Number of CoreDNS panics encountered \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" severity : critical annotations : summary : Freeswitch down (instance {{ $labels.instance }}) description : " Freeswitch is unresponsive \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: FreeswitchSessionsWarning
    expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Freeswitch Sessions Warning (instance {{ $labels.instance }})
      description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: FreeswitchSessionsCritical
    expr: (freeswitch_session_active * 100 / freeswitch_session_limit) > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Freeswitch Sessions Critical (instance {{ $labels.instance }})
      description: "High sessions usage on {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Vault sealed (instance {{ $labels.instance }}) description : " Vault instance is sealed on {{ $labels.instance }} \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: VaultTooManyPendingTokens
    expr: avg(vault_token_create_count - vault_token_store_count) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Vault too many pending tokens (instance {{ $labels.instance }})
      description: "Too many pending tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: VaultTooManyInfinityTokens
    expr: vault_token_count_by_ttl{creation_ttl="+Inf"} > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Vault too many infinity tokens (instance {{ $labels.instance }})
      description: "Too many infinity tokens {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: VaultClusterHealth
    expr: sum(vault_core_active) / count(vault_core_active) <= 0.5
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Vault cluster health (instance {{ $labels.instance }})
      description: "Vault cluster is not healthy {{ $labels.instance }}: {{ $value | printf \"%.2f\"}}%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CloudflareHttp4xxErrorRate
    expr: (sum by(zone) (rate(cloudflare_zone_requests_status{status=~"^4.."}[15m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[15m]))) * 100 > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cloudflare http 4xx error rate (instance {{ $labels.instance }})
      description: "Cloudflare high HTTP 4xx error rate (> 5% for domain {{ $labels.zone }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CloudflareHttp5xxErrorRate
    expr: (sum by (zone) (rate(cloudflare_zone_requests_status{status=~"^5.."}[5m])) / on (zone) sum by (zone) (rate(cloudflare_zone_requests_status[5m]))) * 100 > 5
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cloudflare http 5xx error rate (instance {{ $labels.instance }})
      description: "Cloudflare high HTTP 5xx error rate (> 5% for domain {{ $labels.zone }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running. [copy]
  - alert: ThanosCompactorMultipleRunning
    expr: sum by (job) (up{job=~".*thanos-compact.*"}) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor Multiple Running (instance {{ $labels.instance }})
      description: "No more than one Thanos Compact instance should be running at once. There are {{$value}} instances running.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosCompactorHalted
    expr: thanos_compact_halted{job=~".*thanos-compact.*"} == 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor Halted (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} has failed to run and now is halted.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions. [copy]
  - alert: ThanosCompactorHighCompactionFailures
    expr: (sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compactor High Compaction Failures (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy]
  -




    
 alert: ThanosCompactBucketHighOperationFailures
    expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosCompactHasNotRun
    expr: (time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Thanos Compact Has Not Run (instance {{ $labels.instance }})
      description: "Thanos Compact {{$labels.job}} has not uploaded anything for 24 hours.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests. [copy]
  - alert: ThanosQueryHttpRequestQueryErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests. [copy]
  - alert: ThanosQueryHttpRequestQueryRangeErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of \"query_range\" requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosQueryGrpcServerErrorRate
    expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosQueryGrpcClientErrorRate
    expr: (sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints. [copy]
  - alert: ThanosQueryHighDNSFailures
    expr: (sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))) * 100 > 1
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query High D N S Failures (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries. [copy]
  - alert: ThanosQueryInstantLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Instant Latency High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has a 99th percentile latency of 




    
{{$value}} seconds for instant queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries. [copy]
  - alert: ThanosQueryRangeLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Query Range Latency High (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support. [copy]
  - alert: ThanosQueryOverload
    expr: (max_over_time(thanos_query_concurrent_gate_queries_max[5m]) - avg_over_time(thanos_query_concurrent_gate_queries_in_flight[5m]) < 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Query Overload (instance {{ $labels.instance }})
      description: "Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosReceiveHttpRequestErrorRateHigh
    expr: (sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))/  sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))) * 100 > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests. [copy]
  - alert: ThanosReceiveHttpRequestLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0)
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Http Request Latency High (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests. [copy]
  - alert: ThanosReceiveHighReplicationFailures
    expr: thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}))) * 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive High Replication Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosReceiveHighForwardRequestFailures
    expr: (sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))/  sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))) * 100 > 20
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Receive High Forward Request Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to forward {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed. [copy]
  - alert: ThanosReceiveHighHashringFileRefreshFailures
    expr: (sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosReceiveConfigReloadFailure
    expr: avg by (job) (thanos_receive_config_last_reload_successful{job=~".*thanos-receive.*"}) != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Receive Config Reload Failure (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.job}} has not been able to reload hashring configurations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosReceiveNoUpload
    expr: (up{job=~".*thanos-receive.*"} - 1) + on (job, instance) (sum by (job, instance) (increase(thanos_shipper_uploads_total{job=~".*thanos-receive.*"}[3h])) == 0)
    for: 3h
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive No Upload (instance {{ $labels.instance }})
      description: "Thanos Receive {{$labels.instance}} has not uploaded latest data to object storage.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosSidecarBucketOperationsFailed
    expr: sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-sidecar.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar Bucket Operations Failed (instance {{ $labels.instance }})
      description: "Thanos Sidecar {{$labels.instance}} bucket operations are failing\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosSidecarNoConnectionToStartedPrometheus
    expr: thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }})
      description: "Thanos Sidecar {{$labels.instance}} is unhealthy.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosStoreGrpcErrorRate
    expr: (sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))/  sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Grpc Error Rate (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests. [copy]
  - alert: ThanosStoreSeriesGateLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Series Gate Latency High (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations. [copy]
  - alert: ThanosStoreBucketHighOperationFailures
    expr: (sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations. [copy]
  - alert




    
: ThanosStoreObjstoreOperationLatencyHigh
    expr: (histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and  sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }})
      description: "Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleQueueIsDroppingAlerts
    expr: sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to queue alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleSenderIsFailingAlerts
    expr: sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to send alerts to alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleHighRuleEvaluationFailures
    expr: (sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} is failing to evaluate rules.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleHighRuleEvaluationWarnings
    expr: sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0
    for: 15m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has high number of evaluation warnings.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}. [copy]
  - alert: ThanosRuleRuleEvaluationLatencyHigh
    expr: (sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleGrpcErrorRate
    expr: (sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))/  sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} is failing to handle {{$value | humanize}}% of requests.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosRuleConfigReloadFailure
    expr: avg by (job, instance) (thanos_rule_config_last_reload_successful{job=~".*thanos-rule.*"}) != 1
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule Config Reload Failure (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} has not been able to reload its configuration.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints. [copy]
  - alert: ThanosRuleQueryHighDNSFailures
    expr: (sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})




    

      description: "Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints. [copy]
  - alert: ThanosRuleAlertmanagerHighDNSFailures
    expr: (sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval. [copy]
  - alert: ThanosRuleNoEvaluationFor10Intervals
    expr: time() -  max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})
    for: 5m
    labels:
      severity: info
    annotations:
      summary: Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes. [copy]
  - alert: ThanosNoRuleEvaluations
    expr: sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0  and sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos No Rule Evaluations (instance {{ $labels.instance }})
      description: "Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosBucketReplicateErrorRate
    expr: (sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))/ on (job) group_left  sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))) * 100 >= 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }})
      description: "Thanos Replicate is failing to run, {{$value | humanize}}% of attempts failed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations. [copy]
  - alert: ThanosBucketReplicateRunLatency
    expr: (histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 and  sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})
      description: "Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ThanosCompactIsDown
    expr: absent(up{job=~".*thanos-compact.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Compact Is Down (instance {{ $labels.instance }})
      description: "ThanosCompact has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Thanos Query Is Down (instance {{ $labels.instance }}) description : " ThanosQuery has disappeared. Prometheus target for the component cannot be discovered. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert:




    
 ThanosReceiveIsDown
    expr: absent(up{job=~".*thanos-receive.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Receive Is Down (instance {{ $labels.instance }})
      description: "ThanosReceive has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Thanos Rule Is Down (instance {{ $labels.instance }}) description : " ThanosRule has disappeared. Prometheus target for the component cannot be discovered. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: ThanosSidecarIsDown
    expr: absent(up{job=~".*thanos-sidecar.*"} == 1)
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Thanos Sidecar Is Down (instance {{ $labels.instance }})
      description: "ThanosSidecar has disappeared. Prometheus target for the component cannot be discovered.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : Thanos Store Is Down (instance {{ $labels.instance }}) description : " ThanosStore has disappeared. Prometheus target for the component cannot be discovered. \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: LokiProcessTooManyRestarts
    expr: changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Loki process too many restarts (instance {{ $labels.instance }})
      description: "A loki process had too many restarts (target {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: LokiRequestErrors
    expr: 100 * sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[1m])) by (namespace, job, route) / sum(rate(loki_request_duration_seconds_count[1m])) by (namespace, job, route) > 10
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: Loki request errors (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} and {{ $labels.route }} are experiencing errors\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: LokiRequestPanic
    expr: sum(increase(loki_panic_total[10m])) by (namespace, job) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Loki request panic (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} is experiencing {{ printf \"%.2f\" $value }}% increase of panics\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency [copy]
  - alert: LokiRequestLatency
    expr: (histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket{route!~"(?i).*tail.*"}[5m])) by (le)))  > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Loki request latency (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors. [copy]
  - alert: PromtailRequestErrors
    expr: 100 * sum(rate(promtail_request_duration_seconds_count{status_code=~"5..|failed"}[1m])) by (namespace, job, route, instance) / sum(rate(promtail_request_duration_seconds_count[1m])) by (namespace, job, route, instance) > 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Promtail request errors (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is




    
 experiencing {{ printf \"%.2f\" $value }}% errors.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency. [copy]
  - alert: PromtailRequestLatency
    expr: histogram_quantile(0.99, sum(rate(promtail_request_duration_seconds_bucket[5m])) by (le)) > 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Promtail request latency (instance {{ $labels.instance }})
      description: "The {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf \"%.2f\" $value }}s 99th percentile latency.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexRulerConfigurationReloadFailure
    expr: cortex_ruler_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cortex ruler configuration reload failure (instance {{ $labels.instance }})
      description: "Cortex ruler configuration reload failure (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexNotConnectedToAlertmanager
    expr: cortex_prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex not connected to Alertmanager (instance {{ $labels.instance }})
      description: "Cortex not connected to Alertmanager (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexNotificationAreBeingDropped
    expr: rate(cortex_prometheus_notifications_dropped_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex notification are being dropped (instance {{ $labels.instance }})
      description: "Cortex notification are being dropped due to errors (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexNotificationError
    expr: rate(cortex_prometheus_notifications_errors_total[5m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex notification error (instance {{ $labels.instance }})
      description: "Cortex is failing when sending alert notifications (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexIngesterUnhealthy
    expr: cortex_ring_members{state="Unhealthy", name="ingester"} > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cortex ingester unhealthy (instance {{ $labels.instance }})
      description: "Cortex has an unhealthy ingester\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: CortexFrontendQueriesStuck
    expr: sum by (job) (cortex_query_frontend_queue_length) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: Cortex frontend queries stuck (instance {{ $labels.instance }})
      description: "There are queued up queries in query-frontend.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Jenkins offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] severity : critical annotations : summary : Jenkins offline (instance {{ $labels.instance }}) description : " Jenkins offline: `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy] severity : critical annotations : summary : Jenkins healthcheck (instance {{ $labels.instance }}) description : " Jenkins healthcheck score: {{$value}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: JenkinsOutdatedPlugins
    expr: sum(jenkins_plugins_withUpdate) by (instance) > 3
    for: 1d
    labels:
      severity: warning
    annotations:
      summary: Jenkins outdated plugins (instance {{ $labels.instance }})
      description: "{{ $value }} plugins need update\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]
  - alert: JenkinsBuildsHealthScore
    expr: default_jenkins_builds_health_score < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Jenkins builds health score (instance {{ $labels.instance }})
      description: "Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}}) [copy]
  - alert: JenkinsRunFailureTotal
    expr: delta(jenkins_runs_failure_total[1h]) > 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins run failure total (instance {{ $labels.instance }})
      description: "Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy]
  - alert: JenkinsBuildTestsFailing
    expr: default_jenkins_builds_last_build_tests_failing > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins build tests failing (instance {{ $labels.instance }})
      description: "Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}}) [copy]
  # * RUNNING  -1 true  - The build had no errors.
  # * SUCCESS   0 true  - The build had no errors.
  # * UNSTABLE  1 true  - The build had some errors but they were not fatal. For example, some tests failed.
  # * FAILURE   2 false - The build had a fatal error.
  # * NOT_BUILT 3 false - The module was not built.
  # * ABORTED   4 false - The build was manually aborted.
  - alert: JenkinsLastBuildFailed
    expr: default_jenkins_builds_last_build_result_ordinal == 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Jenkins last build failed (instance {{ $labels.instance }})
      description: "Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : critical annotations : summary : APC UPS Battery nearly empty (instance {{ $labels.instance }}) description : " Battery is almost empty (< 10% left) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: ApcUpsLessThan15MinutesOfBatteryTimeRemaining
    expr: apcupsd_battery_time_left_seconds < 900
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance }})
      description: "Battery is almost empty (< 15 Minutes remaining)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : APC UPS AC input outage (instance {{ $labels.instance }}) description : " UPS now running on battery (since {{$value | humanizeDuration}}) \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"
  - alert: ApcUpsLowBatteryVoltage
    expr: (apcupsd_battery_volts / apcupsd_battery_nominal_volts) < 0.95
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: APC UPS low battery voltage (instance {{ $labels.instance }})
      description: "Battery voltage is lower than nominal (< 95%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
  - alert: ApcUpsHighTemperature
    expr: apcupsd_internal_temperature_celsius >= 40
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: APC UPS high temperature (instance {{ $labels.instance }})
      description: "Internal temperature is high ({{$value}}°C)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
severity : warning annotations : summary : APC UPS high load (instance {{ $labels.instance }}) description : " UPS load is > 80% \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] severity : critical annotations : summary : Provider failed because net_version failed (instance {{ $labels.instance }}) description : " Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] severity : critical annotations : summary : Provider failed because get genesis failed (instance {{ $labels.instance }}) description : " Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] severity : critical annotations : summary : Provider failed because net_version timeout (instance {{ $labels.instance }}) description : " net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` [copy] severity : critical annotations : summary : Provider failed because get genesis timeout (instance {{ $labels.instance }}) description : " Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy] severity : warning annotations : summary : Store connection is too slow (instance {{ $labels.instance }}) description : " Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}" Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` [copy] severity : critical annotations : summary : Store connection is too slow (instance {{ $labels.instance }}) description : " Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}` \n VALUE = {{ $value }} \n LABELS = {{ $labels }}"