Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
[copy]
-alert:PrometheusTooManyRestartsexpr:changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2for:0mlabels:severity:warningannotations:summary:Prometheus too many restarts (instance {{ $labels.instance }})description:"Prometheushasrestartedmorethantwiceinthelast15minutes.Itmightbecrashlooping.\nVALUE={{$value}}\nLABELS={{$labels}}"
Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.
[copy]
severity
:
critical
annotations
:
summary
:
Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
description
:
"
Prometheus
DeadManSwitch
is
an
always-firing
alert.
It's
used
as
an
end-to-end
test
of
Prometheus
through
the
Alertmanager.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:PrometheusNotConnectedToAlertmanagerexpr:prometheus_notifications_alertmanagers_discovered < 1for:0mlabels:severity:criticalannotations:summary:Prometheus not connected to alertmanager (instance {{ $labels.instance }})description:"Prometheuscannotconnectthealertmanager\nVALUE={{$value}}\nLABELS={{$labels}}"
Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.
[copy]
severity
:
critical
annotations
:
summary
:
Prometheus target empty (instance {{ $labels.instance }})
description
:
"
Prometheus
has
no
target
in
service
discovery
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.
[copy]
Node memory is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})
[copy]
# You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly-alert:HostMemoryIsUnderutilizedexpr:(100 - (avg_over_time(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:1wlabels:severity:infoannotations:summary:Host Memory is underutilized (instance {{ $labels.instance }})description:"Nodememoryis<20%for1week.Considerreducingmemoryspace.(instance{{$labels.instance}})\nVALUE={{$value}}\nLABELS={{$labels}}"
# Please add ignored mountpoints in node_exporter parameters like# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.-alert:HostOutOfDiskSpaceexpr:((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:2mlabels:severity:warningannotations:summary:Host out of disk space (instance {{ $labels.instance }})description:"Diskisalmostfull(<10%left)\nVALUE={{$value}}
\nLABELS={{$labels}}"
Filesystem is predicted to run out of space within the next 24 hours at current write rate
[copy]
# Please add ignored mountpoints in node_exporter parameters like# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.-alert:HostDiskWillFillIn24Hoursexpr:((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:2mlabels:severity:warningannotations:summary:Host disk will fill in 24 hours (instance {{ $labels.instance }})description:"Filesystemispredictedtorunoutofspacewithinthenext24hoursatcurrentwriterate\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HostOutOfInodesexpr:(node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:2mlabels:severity:warningannotations:summary:Host out of inodes (instance {{ $labels.instance }})description:"Diskisalmostrunningoutofavailableinodes(<10%left)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Host filesystem device error (instance {{ $labels.instance }})
description
:
"
{{
$labels.instance
}}:
Device
error
with
the
{{
$labels.mountpoint
}}
filesystem
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Filesystem is predicted to run out of inodes within the next 24 hours at current write rate
[copy]
-alert:HostInodesWillFillIn24Hoursexpr:(node_filesystem_files_free{fstype!="msdosfs"} / node_filesystem_files{fstype!="msdosfs"} * 100 < 10 and predict_linear(node_filesystem_files_free{fstype!="msdosfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly{fstype!="msdosfs"} == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:2mlabels:severity:warningannotations:summary:Host inodes will fill in 24 hours (instance {{ $labels.instance }})description:"Filesystemispredictedtorunoutofinodeswithinthenext24hoursatcurrentwriterate\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HostHighCpuLoadexpr:(sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:10mlabels:severity:warningannotations:summary:Host high CPU load (instance {{ $labels.instance }})description:"CPUloadis>80%\nVALUE={{$value}}\nLABELS={{$labels}}"
# You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly-alert:HostCpuIsUnderutilizedexpr:(100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}for:1wlabels:severity:infoannotations:summary:Host CPU is underutilized (instance {{ $labels.instance }})description:"CPUloadis<20%for1week.ConsiderreducingthenumberofCPUs.\nVALUE={{$value}}\nLABELS={{$labels}}"
CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.
[copy]
RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.
[copy]
# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.-alert:ContainerKilledexpr:time() - container_last_seen > 60for:0mlabels:severity:warningannotations:summary:Container killed (instance {{ $labels.instance }})description:"Acontainerhasdisappeared\nVALUE={{$value}}\nLABELS={{$labels}}"
# This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.-alert:ContainerAbsentexpr:absent(container_last_seen)for:5mlabels:severity:warningannotations:summary:Container absent (instance {{ $labels.instance }})description:"Acontainerisabsentfor5min\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ContainerHighCpuUtilizationexpr:(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80for:2mlabels:severity:warningannotations:summary:Container High CPU utilization (instance {{ $labels.instance }})description:"ContainerCPUutilizationisabove80%\nVALUE={{$value}}\nLABELS={{$labels}}"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d-alert:ContainerHighMemoryUsageexpr:(sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80for:2mlabels:severity:warningannotations:summary:Container High Memory usage (instance {{ $labels.instance }})description:"ContainerMemoryusageisabove80%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ContainerVolumeUsageexpr:(1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80for:2mlabels:severity:warningannotations:summary:Container Volume usage (instance {{ $labels.instance }})description:"ContainerVolumeusageisabove80%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ContainerHighThrottleRateexpr:sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) > ( 25 / 100 )for:5mlabels:severity:warningannotations:summary:Container high throttle rate (instance {{ $labels.instance }})description:"Containerisbeingthrottled\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ContainerLowCpuUtilizationexpr:(sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) < 20for:7dlabels:severity:infoannotations:summary:Container Low CPU utilization (instance {{ $labels.instance }})description:"ContainerCPUutilizationisunder20%for1week.ConsiderreducingtheallocatedCPU.\nVALUE={{$value}}\nLABELS={{$labels}}"
# For probe_ssl_earliest_cert_expiry to be exposed after expiration, you# need to enable insecure_skip_verify. Note that this will disable# certificate validation.# See https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md#tls_config-alert:BlackboxSslCertificateExpiredexpr:round((last_over_time(probe_ssl_earliest_cert_expiry[10m]) - time()) / 86400, 0.1) < 0for:0mlabels:severity:criticalannotations:summary:Blackbox SSL certificate expired (instance {{ $labels.instance }})description:"SSLcertificatehasexpiredalready\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:WindowsServerCollectorErrorexpr:windows_exporter_collector_success == 0for:0mlabels:severity:criticalannotations:summary:Windows Server collector Error (instance {{ $labels.instance }})description:"Collector{{$labels.collector}}wasnotsuccessful\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:WindowsServerServiceStatusexpr:windows_service_status{status="ok"} != 1for:1mlabels:severity:criticalannotations:summary:Windows Server service Status (instance {{ $labels.instance }})description:"WindowsServicestateisnotOK\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:WindowsServerCpuUsageexpr:100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100) > 80for:0mlabels:severity:warningannotations:summary:Windows Server CPU Usage (instance {{ $labels.instance }})description:"CPUUsageismorethan80%\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
info
annotations
:
summary
:
MySQL restarted (instance {{ $labels.instance }})
description
:
"
MySQL
has
just
been
restarted,
less
than
one
minute
ago
on
{{
$labels.instance
}}.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Postgresql down (instance {{ $labels.instance }})
description
:
"
Postgresql
instance
is
down
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Postgresql exporter error (instance {{ $labels.instance }})
description
:
"
Postgresql
exporter
is
showing
errors.
A
query
may
be
buggy
in
query.yaml
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:PostgresqlTableNotAutoVacuumedexpr:(pg_stat_user_tables_last_autovacuum > 0) and (time() - pg_stat_user_tables_last_autovacuum) > 60 * 60 * 24 * 10for:0mlabels:severity:warningannotations:summary:Postgresql table not auto vacuumed (instance {{ $labels.instance }})description:"Table{{$labels.relname}}hasnotbeenautovacuumedfor10days\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PostgresqlTableNotAutoAnalyzedexpr:(pg_stat_user_tables_last_autoanalyze > 0) and (time() - pg_stat_user_tables_last_autoanalyze) > 24 * 60 * 60 * 10for:0mlabels:severity:warningannotations:summary:Postgresql table not auto analyzed (instance {{ $labels.instance }})description:"Table{{$labels.relname}}hasnotbeenautoanalyzedfor10days\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PostgresqlTooManyConnectionsexpr:sum by (instance, job, server) (pg_stat_activity_count) > min by (instance, job, server) (pg_settings_max_connections * 0.8)for:2mlabels:severity:warningannotations:summary:Postgresql too many connections (instance {{ $labels.instance }})description:"PostgreSQLinstancehastoomanyconnections(>80%).\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PostgresqlNotEnoughConnectionsexpr:sum by (datname) (pg_stat_activity_count{datname!~"template.*|postgres"}) < 5for:2mlabels:severity:warningannotations:summary:Postgresql not enough connections (instance {{ $labels.instance }})description:"PostgreSQLinstanceshouldhavemoreconnections(>5)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PostgresqlDeadLocksexpr:increase(pg_stat_database_deadlocks{datname!~"template.*|postgres"}[1m]) > 5for:0mlabels:severity:warningannotations:summary:Postgresql dead locks (instance {{ $labels.instance }})description:"PostgreSQLhasdead-locks\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PostgresqlHighRollbackRateexpr:sum by (namespace,datname) ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) / ((rate(pg_stat_database_xact_rollback{datname!~"template.*|postgres",datid!="0"}[3m])) + (rate(pg_stat_database_xact_commit{datname!~"template.*|postgres",datid!="0"}[3m])))) > 0.02for:0mlabels:severity:warningannotations:summary:Postgresql high rollback rate (instance {{ $labels.instance }})description:"Ratiooftransactionsbeingabortedcomparedtocommittedis>2%\nVALUE={{$value}}\nLABELS={{$labels}}"
Database connections with SSL compression enabled. This may add significant jitter in replication delay. Replicas should turn off SSL compression via `sslcompression=0` in `recovery.conf`.
[copy]
-alert:PostgresqlSslCompressionActiveexpr:sum(pg_stat_ssl_compression) > 0for:0mlabels:severity:criticalannotations:summary:Postgresql SSL compression active (instance {{ $labels.instance }})description:"DatabaseconnectionswithSSLcompressionenabled.Thismayaddsignificantjitterinreplicationdelay.ReplicasshouldturnoffSSLcompressionvia`sslcompression=0`in`recovery.conf`.\nVALUE={{$value}}\nLABELS={{$labels}}"
Too many locks acquired on the database. If this alert happens frequently, we may need to increase the postgres setting max_locks_per_transaction.
[copy]
-alert:PostgresqlTooManyLocksAcquiredexpr:((sum (pg_locks_count)) / (pg_settings_max_locks_per_transaction * pg_settings_max_connections)) > 0.20for:2mlabels:severity:criticalannotations:summary:Postgresql too many locks acquired (instance {{ $labels.instance }})description:"Toomanylocksacquiredonthedatabase.Ifthisalerthappensfrequently,wemayneedtoincreasethepostgressettingmax_locks_per_transaction.\nVALUE={{$value}}\nLABELS={{$labels}}"
The index {{ $labels.idxname }} is bloated. You should execute `REINDEX INDEX CONCURRENTLY {{ $labels.idxname }};`
[copy]
# See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737-alert:PostgresqlBloatIndexHigh(>80%)expr:pg_bloat_btree_bloat_pct > 80 and on (idxname) (pg_bloat_btree_real_size > 100000000)for:1hlabels:severity:warningannotations:summary:Postgresql bloat index high (> 80%) (instance {{ $labels.instance }})description:"Theindex{{$labels.idxname}}isbloated.Youshouldexecute`REINDEXINDEXCONCURRENTLY{{$labels.idxname}};`\nVALUE={{$value}}\nLABELS={{$labels}}"
The table {{ $labels.relname }} is bloated. You should execute `VACUUM {{ $labels.relname }};`
[copy]
# See https://github.com/samber/awesome-prometheus-alerts/issues/289#issuecomment-1164842737-alert:PostgresqlBloatTableHigh(>80%)expr:pg_bloat_table_bloat_pct > 80 and on (relname) (pg_bloat_table_real_size > 200000000)for:1hlabels:severity:warningannotations:summary:Postgresql bloat table high (> 80%) (instance {{ $labels.instance }})
description:"Thetable{{$labels.relname}}isbloated.Youshouldexecute`VACUUM{{$labels.relname}};`\nVALUE={{$value}}\nLABELS={{$labels}}"
The table {{ $labels.relname }} has an invalid index: {{ $labels.indexrelname }}. You should execute `DROP INDEX {{ $labels.indexrelname }};`
[copy]
-alert:PostgresqlInvalidIndexexpr:pg_genaral_index_info_pg_relation_size{indexrelname=~".*ccnew.*"}for:6hlabels:severity:warningannotations:summary:Postgresql invalid index (instance {{ $labels.instance }})description:"Thetable{{$labels.relname}}hasaninvalidindex:{{$labels.indexrelname}}.Youshouldexecute`DROPINDEX{{$labels.indexrelname}};`\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
SQL Server down (instance {{ $labels.instance }})
description
:
"
SQL
server
instance
is
down
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
SQL Server deadlock (instance {{ $labels.instance }})
description
:
"
SQL
Server
is
having
some
deadlock.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
A leader node (neither primary nor standby) cannot be found inside the cluster {{ $labels.scope }}
[copy]
-alert:PatroniHasNoLeaderexpr:(max by (scope) (patroni_master) < 1) and (max by (scope) (patroni_standby_leader) < 1)for:0mlabels:severity:criticalannotations:summary:Patroni has no Leader (instance {{ $labels.instance }})description:"Aleadernode(neitherprimarynorstandby)cannotbefoundinsidethecluster{{$labels.scope}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PgbouncerActiveConnectionsexpr:pgbouncer_pools_server_active_connections > 200for:2mlabels:severity:warningannotations:summary:PGBouncer active connections (instance {{ $labels.instance }})description:"PGBouncerpoolsarefillingup\nVALUE={{$value}}\nLABELS={{$labels}}"
PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.
[copy]
-alert:RedisTooManyMastersexpr:count(redis_instance_info{role="master"}) >
1for:0mlabels:severity:criticalannotations:summary:Redis too many masters (instance {{ $labels.instance }})description:"Redisclusterhastoomanynodesmarkedasmaster.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:RedisDisconnectedSlavesexpr:count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 0for:0mlabels:severity:criticalannotations:summary:Redis disconnected slaves (instance {{ $labels.instance }})description:"Redisnotreplicatingforallslaves.Considerreviewingtheredisreplicationstatus.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Redis replication broken (instance {{ $labels.instance }})
description
:
"
Redis
instance
lost
a
slave
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).
[copy]
severity
:
critical
annotations
:
summary
:
Redis cluster flapping (instance {{ $labels.instance }})
description
:
"
Changes
have
been
detected
in
Redis
replica
connection.
This
can
occur
when
replica
nodes
lose
connection
to
the
master
and
reconnect
(a.k.a
flapping).
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
# The exporter must be started with --include-system-metrics flag or REDIS_EXPORTER_INCL_SYSTEM_METRICS=true environment variable.-alert:RedisOutOfSystemMemoryexpr:redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90for:2mlabels:severity:warningannotations:summary:Redis out of system memory (instance {{ $labels.instance }})description:"Redisisrunningoutofsystemmemory(>90%)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:RedisOutOfConfiguredMaxmemoryexpr:redis_memory_used_bytes / redis_memory_max_bytes * 100 > 90 and on(instance) redis_memory_max_bytes > 0for:2mlabels:severity:warningannotations:summary:Redis out of configured maxmemory (instance {{ $labels.instance }})description:"Redisisrunningoutofconfiguredmaxmemory(>90%)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:RedisTooManyConnectionsexpr:redis_connected_clients / redis_config_maxclients * 100 > 90for:2mlabels:severity:warningannotations:summary:Redis too many connections (instance {{ $labels.instance }})description:"Redisisrunningoutofconnections(>90%used)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Redis not enough connections (instance {{ $labels.instance }})
description
:
"
Redis
instance
should
have
more
connections
(>
5)
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
MongoDB Replication set member either perform startup self-checks, or transition from completing a rollback or resync
[copy]
severity
:
critical
annotations
:
summary
:
MongoDB replication Status 3 (instance {{ $labels.instance }})
description
:
"
MongoDB
Replication
set
member
either
perform
startup
self-checks,
or
transition
from
completing
a
rollback
or
resync
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
MongoDB replication Status 6 (instance {{ $labels.instance }})
description
:
"
MongoDB
Replication
set
member
as
seen
from
another
member
of
the
set,
is
not
yet
known
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
MongoDB replication Status 8 (instance {{ $labels.instance }})
description
:
"
MongoDB
Replication
set
member
as
seen
from
another
member
of
the
set,
is
unreachable
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
MongoDB Replication set member is actively performing a rollback. Data is not available for reads
[copy]
severity
:
critical
annotations
:
summary
:
MongoDB replication Status 9 (instance {{ $labels.instance }})
description
:
"
MongoDB
Replication
set
member
is
actively
performing
a
rollback.
Data
is
not
available
for
reads
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:MongodbReplicationStatus10expr:mongodb_replset_member_state == 10for:0mlabels:severity:criticalannotations:summary:MongoDB replication Status 10 (instance {{ $labels.instance }})description:"MongoDBReplicationsetmemberwasonceinareplicasetbutwassubsequentlyremoved\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:MongodbNumberCursorsOpenexpr:mongodb_metrics_cursor_open{state="total_open"} > 10000for:2mlabels:severity:warningannotations:summary:MongoDB number cursors open (instance {{ $labels.instance }})description:"ToomanycursorsopenedbyMongoDBforclients(>10k)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:RabbitmqTooManyUnackMessagesexpr:sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000for:1mlabels:severity:warningannotations:summary:RabbitMQ too many unack messages (instance {{ $labels.instance }})description:"Toomanyunacknowledgedmessages\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
RabbitMQ too many connections (instance {{ $labels.instance }})
description
:
"
The
total
connections
of
a
node
is
too
high
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
RabbitMQ no queue consumer (instance {{ $labels.instance }})
description
:
"
A
queue
has
less
than
1
consumer
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:RabbitmqOutOfMemoryexpr:rabbitmq_node_mem_used / rabbitmq_node_mem_limit * 100 > 90for:2mlabels:severity:warningannotations:summary:RabbitMQ out of memory (instance {{ $labels.instance }})description:"MemoryavailableforRabbmitMQislow(<10%)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
RabbitMQ too many connections (instance {{ $labels.instance }})
description
:
"
RabbitMQ
instance
has
too
many
connections
(>
1000)
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
# Indicate the queue name in dedicated label.-alert:RabbitmqDeadLetterQueueFillingUpexpr:rabbitmq_queue_messages{queue="my-dead-letter-queue"} > 10for:1mlabels:severity:warningannotations:summary:RabbitMQ dead letter queue filling up (instance {{ $labels.instance }})description:"Deadletterqueueisfillingup(>10msgs)\nVALUE={{$value}}\nLABELS={{$labels}}"
# Indicate the queue name in dedicated label.-alert:RabbitmqTooManyMessagesInQueueexpr:rabbitmq_queue_messages_ready{queue="my-queue"} > 1000for:2mlabels:severity:warningannotations:summary:RabbitMQ too many messages in queue (instance {{ $labels.instance }})description:"Queueisfillingup(>1000msgs)\nVALUE={{$value}}\nLABELS={{$labels}}"
# Indicate the queue name in dedicated label.-alert:RabbitmqSlowQueueConsumingexpr:time() - rabbitmq_queue_head_message_timestamp{queue="my-queue"} > 60for:2mlabels:severity:warningannotations:summary:RabbitMQ slow queue consuming (instance {{ $labels.instance }})description:"Queuemessagesareconsumedslowly(>60s)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
RabbitMQ no consumer (instance {{ $labels.instance }})
description
:
"
Queue
has
no
consumer
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
# Indicate the queue name in dedicated label.-alert:RabbitmqTooManyConsumersexpr:rabbitmq_queue_consumers{queue="my-queue"} > 1for:0mlabels:severity:criticalannotations:summary:RabbitMQ too many consumers (instance {{ $labels.instance }})description:"Queueshouldhaveonly1consumer\nVALUE={{$value}}\nLABELS={{$labels}}"
# Indicate the exchange name in dedicated label.-alert:RabbitmqUnactiveExchangeexpr:rate(rabbitmq_exchange_messages_published_in_total{exchange="my-exchange"}[1m]) < 5for:2mlabels:severity:warningannotations:summary:RabbitMQ unactive exchange (instance {{ $labels.instance }})description:"Exchangereceivelessthan5msgspersecond\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ElasticsearchHeapUsageTooHighexpr:(elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) * 100 > 90for:2mlabels:severity:criticalannotations:summary:Elasticsearch Heap Usage Too High (instance {{ $labels.instance }})description:"Theheapusageisover90%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ElasticsearchDiskOutOfSpaceexpr:elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 10for:0mlabels:severity:criticalannotations:summary:Elasticsearch disk out of space (instance {{ $labels.instance }})description:"Thediskusageisover90%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ElasticsearchDiskSpaceLowexpr:elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes * 100 < 20for:2mlabels:severity:warningannotations:summary:Elasticsearch disk space low (instance {{ $labels.instance }})description:"Thediskusageisover80%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ElasticsearchClusterRedexpr:elasticsearch_cluster_health_status{color="red"} == 1for:0mlabels:severity:criticalannotations:summary:Elasticsearch Cluster Red (instance {{ $labels.instance }})description:"ElasticClusterRedstatus\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ElasticsearchNoNewDocumentsexpr:increase(elasticsearch_indices_indexing_index_total{es_data_node="true"}[10m]) < 1for:0mlabels:severity:warningannotations:summary:Elasticsearch no new documents (instance {{ $labels.instance }})description:"Nonewdocumentsfor10min!\nVALUE={{$value}}\nLABELS={{$labels}}"
A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
[copy]
A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
[copy]
-alert:ClickhouseNoAvailableReplicasexpr:ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1for:0mlabels:severity:criticalannotations:summary:ClickHouse No Available Replicas (instance {{ $labels.instance }})description:"NoavailablereplicasinClickHouse.\nVALUE={{$value}}\nLABELS={{$labels
}}"
-alert:ClickhouseNoLiveReplicasexpr:ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1for:0mlabels:severity:criticalannotations:summary:ClickHouse No Live Replicas (instance {{ $labels.instance }})description:"Therearetoofewlivereplicasavailable,riskingdatalossandservicedisruption.\nVALUE={{$value}}\nLABELS={{$labels}}"
# Please replace the threshold with an appropriate value-alert:ClickhouseHighNetworkTrafficexpr:ClickHouseMetrics_NetworkSend > 250 or ClickHouseMetrics_NetworkReceive > 250for:5mlabels:severity:warningannotations:summary:ClickHouse High Network Traffic (instance {{ $labels.instance }})description:"Networktrafficisunusuallyhigh,mayaffectclusterperformance.\nVALUE={{$value}}\nLABELS={{$labels}}"
# Please replace the threshold with an appropriate value-alert:ClickhouseHighTcpConnectionsexpr:ClickHouseMetrics_TCPConnection > 400for:5mlabels:severity:warningannotations:summary:ClickHouse High TCP Connections (instance {{ $labels.instance }})description:"HighnumberofTCPconnections,indicatingheavyclientorinter-clustercommunication.\nVALUE={{$value}}\nLABELS={{$labels}}"
An increase in interserver connections may indicate replication or distributed query handling issues.
[copy]
-alert:PulsarSubscriptionHighNumberOfBacklogEntriesexpr:sum(pulsar_subscription_back_log) by (subscription) > 5000for:1hlabels:severity:warningannotations:summary:Pulsar subscription high number of backlog entries (instance {{ $labels.instance }})description:"Thenumberofsubscriptionbacklogentriesisover5k\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarSubscriptionVeryHighNumberOfBacklogEntriesexpr:sum(pulsar_subscription_back_log) by (subscription) > 100000for:1hlabels:severity:criticalannotations:summary:Pulsar subscription very high number of backlog entries (instance {{ $labels.instance }})description:"Thenumberofsubscriptionbacklogentriesisover100k\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarTopicLargeBacklogStorageSizeexpr:sum(pulsar_storage_size > 5*1024*1024*1024) by (topic)for:1hlabels:severity:warningannotations:summary:Pulsar topic large backlog storage size (instance {{ $labels.instance }})description:"Thetopicbacklogstoragesizeisover5GB\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarTopicVeryLargeBacklogStorageSizeexpr:sum(pulsar_storage_size > 20*1024*1024*1024) by (topic)for:1hlabels:severity:criticalannotations:summary:Pulsar topic very large backlog storage size (instance {{ $labels.instance }})description:"Thetopicbacklogstoragesizeisover20GB\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarHighWriteLatencyexpr:sum(pulsar_storage_write_latency_overflow > 0) by (topic)for:1hlabels:severity:criticalannotations:summary:Pulsar high write latency (instance {{ $labels.instance }})description:"Messagescannotbewritteninatimelyfashion\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarLargeMessagePayloadexpr:sum(pulsar_entry_size_overflow > 0) by (topic)for:1hlabels:severity:warningannotations:summary:Pulsar large message payload (instance {{ $labels.instance }})description:"Observinglargemessagepayload(>1MB)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarHighLedgerDiskUsageexpr:sum(bookie_ledger_dir__pulsar_data_bookkeeper_ledgers_usage) by (kubernetes_pod_name) > 75for:1hlabels:severity:criticalannotations:summary:Pulsar high ledger disk usage (instance {{ $labels.instance }})description:"ObservingLedgerDiskUsage(>75%)\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarReadOnlyBookiesexpr:count(bookie_SERVER_STATUS{} == 0) by (pod)for:5mlabels:severity:criticalannotations:summary:Pulsar read only bookies (instance {{ $labels.instance }})description:"ObservingReadonlyBookies\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarHighNumberOfFunctionErrorsexpr:sum((rate(pulsar_function_user_exceptions_total{}[1m]) + rate(pulsar_function_system_exceptions_total{}[1m])) > 10) by (name)for:1mlabels:severity:criticalannotations:summary:Pulsar high number of function errors (instance {{ $labels.instance }})description:"Observingmorethan10Functionerrorsperminute\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:PulsarHighNumberOfSinkErrorsexpr:sum(rate(pulsar_sink_sink_exceptions_total{}[1m]) > 10) by (name)for:1mlabels:severity:criticalannotations:summary:Pulsar high number of sink errors (instance {{ $labels.instance }})description:"Observingmorethan10Sinkerrorsperminute\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Nats high connection count (instance {{ $labels.instance }})
description
:
"
High
number
of
NATS
connections
({{
$value
}})
for
{{
$labels.instance
}}
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Nats high pending bytes (instance {{ $labels.instance }})
description
:
"
High
number
of
NATS
pending
bytes
({{
$value
}})
for
{{
$labels.instance
}}
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Nats high subscriptions count (instance {{ $labels.instance }})
description
:
"
High
number
of
NATS
subscriptions
({{
$value
}})
for
{{
$labels.instance
}}
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Nats high routes count (instance {{ $labels.instance }})
description
:
"
High
number
of
NATS
routes
({{
$value
}})
for
{{
$labels.instance
}}
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Solr collection {{ $labels.collection }} has failed updates for replica {{ $labels.replica }} on {{ $labels.base_url }}.
[copy]
Solr collection {{ $labels.collection }} has less than two live nodes for replica {{ $labels.replica }} on {{ $labels.base_url }}.
[copy]
severity
:
critical
annotations
:
summary
:
Solr low live node count (instance {{ $labels.instance }})
description
:
"
Solr
collection
{{
$labels.collection
}}
has
less
than
two
live
nodes
for
replica
{{
$labels.replica
}}
on
{{
$labels.base_url
}}.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Hadoop Name Node Down (instance {{ $labels.instance }})
description
:
"
The
Hadoop
NameNode
service
is
unavailable.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:HadoopResourceManagerDownexpr:up{job="hadoop-resourcemanager"} == 0for:5mlabels:severity:criticalannotations:summary:Hadoop Resource Manager Down (instance {{ $labels.instance }})description:"TheHadoopResourceManagerserviceisunavailable.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HadoopDataNodeOutOfServiceexpr:hadoop_datanode_last_heartbeat == 0for:10mlabels:severity:warningannotations:summary:Hadoop Data Node Out Of Service (instance {{ $labels.instance }})description:"TheHadoopDataNodeisnotsendingheartbeats.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HadoopHdfsDiskSpaceLowexpr:(hadoop_hdfs_bytes_total - hadoop_hdfs_bytes_used) / hadoop_hdfs_bytes_total < 0.1for:15mlabels:severity:warningannotations:summary:Hadoop HDFS Disk Space Low (instance {{ $labels.instance }})description:"AvailableHDFSdiskspaceisrunninglow.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Hadoop HBase Region Count High (instance {{ $labels.instance }})
description
:
"
The
HBase
cluster
has
an
unusually
high
number
of
regions.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:HadoopHbaseRegionServerHeapLowexpr:hadoop_hbase_region_server_heap_bytes / hadoop_hbase_region_server_max_heap_bytes < 0.2for:10mlabels:severity:criticalannotations:summary:Hadoop HBase Region Server Heap Low (instance {{ $labels.instance }})description:"HBaseRegionServersarerunninglowonheapspace.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:NginxLatencyHighexpr:histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3for:2mlabels:severity:warningannotations:summary:Nginx latency high (instance {{ $labels.instance }})description:"Nginxp99latencyishigherthan3seconds\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Apache down (instance {{ $labels.instance }})
description
:
"
Apache
down
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Apache workers in busy state approach the max workers count 80% workers busy on {{ $labels.instance }}
[copy]
-alert:ApacheWorkersLoadexpr:(sum by (instance) (apache_workers{state="busy"}) / sum by (instance) (apache_scoreboard) ) * 100 > 80for:2mlabels:severity:warningannotations:summary:Apache workers load (instance {{ $labels.instance }})description:"Apacheworkersinbusystateapproachthemaxworkerscount80%workersbusyon{{$labels.instance}}\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Apache restart (instance {{ $labels.instance }})
description
:
"
Apache
has
just
been
restarted.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
[copy]
-alert:HaproxyHighHttp4xxErrorRateBackendexpr:((sum by (proxy) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus4xx(>5%)onbackend{{$labels.fqdn}}/{{$labels.backend}}\nVALUE={{$value}}\nLABELS={{$labels}}"
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
[copy]
-alert:HaproxyHighHttp5xxErrorRateBackendexpr:((sum by (proxy) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (proxy) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus5xx(>5%)onbackend{{$labels.fqdn}}/{{$labels.backend}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyHighHttp4xxErrorRateServerexpr:((sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus4xx(>5%)onserver{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyHighHttp5xxErrorRateServerexpr:((sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus5xx(>5%)onserver{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyServerResponseErrorsexpr:(sum by (server) (rate(haproxy_server_response_errors_total[1m])) / sum by (server) (rate(haproxy_server_http_responses_total[1m]))) * 100 > 5for:1mlabels:severity:criticalannotations:summary:HAProxy server response errors (instance {{ $labels.instance }})description:"Toomanyresponseerrorsto{{$labels.server}}server(>5%).\nVALUE={{$value}}\nLABELS={{$labels}}"
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.
[copy]
-alert:HaproxyServerHealthcheckFailureexpr:increase(haproxy_server_check_failures_total[1m]) > 0for:1mlabels:severity:warningannotations:summary:HAProxy server healthcheck failure (instance {{ $labels.instance }})description:"Someserverhealthcheckarefailingon{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
HAProxy down (instance {{ $labels.instance }})
description
:
"
HAProxy
down
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
[copy]
-alert:HaproxyHighHttp4xxErrorRateBackendexpr:sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus4xx(>5%)onbackend{{$labels.fqdn}}/{{$labels.backend}}\nVALUE={{$value}}\nLABELS={{$labels}}"
Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.fqdn }}/{{ $labels.backend }}
[copy]
-alert:HaproxyHighHttp5xxErrorRateBackendexpr:sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total[1m])) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus5xx(>5%)onbackend{{$labels.fqdn}}/{{$labels.backend}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyHighHttp4xxErrorRateServerexpr:sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus4xx(>5%)onserver{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyHighHttp5xxErrorRateServerexpr:sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})description:"ToomanyHTTPrequestswithstatus5xx(>5%)onserver{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyServerResponseErrorsexpr:sum by (server) (rate(haproxy_server_response_errors_total[1m]) * 100) / sum by (server) (rate(haproxy_server_http_responses_total[1m])) > 5for:1mlabels:severity:criticalannotations:summary:HAProxy server response errors (instance {{ $labels.instance }})description:"Toomanyresponseerrorsto{{$labels.server}}server(>5%).\nVALUE={{$value}}\nLABELS={{$labels}}"
Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 100 req/s). Request throughput may be too high.
[copy]
-alert:HaproxyBackendMaxActiveSessionexpr:((sum by (backend) (avg_over_time(haproxy_backend_current_sessions[2m]) * 100) / sum by (backend) (avg_over_time(haproxy_backend_limit_sessions[2m])))) > 80for:2mlabels:severity:warningannotations:summary:HAProxy backend max active session (instance {{ $labels.instance }})description:"HAproxybackend{{$labels.fqdn}}/{{$labels.backend}}isreachingsessionlimit(>80%).\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:HaproxyServerHealthcheckFailureexpr:increase(haproxy_server_check_failures_total[1m]) > 0for:1mlabels:severity:warningannotations:summary:HAProxy server healthcheck failure (instance {{ $labels.instance }})description:"Someserverhealthcheckarefailingon{{$labels.server}}\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikServiceDownexpr:count(traefik_service_server_up) by (service) == 0for:0mlabels:severity:criticalannotations:summary:Traefik service down (instance {{ $labels.instance }})description:"AllTraefikservicesaredown\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikHighHttp4xxErrorRateServiceexpr:sum(rate(traefik_service_requests_total{code=~"4.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5for:1mlabels:severity:criticalannotations:summary:Traefik high HTTP 4xx error rate service (instance {{ $labels.instance }})
description:"Traefikservice4xxerrorrateisabove5%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikHighHttp5xxErrorRateServiceexpr:sum(rate(traefik_service_requests_total{code=~"5.*"}[3m])) by (service) / sum(rate(traefik_service_requests_total[3m])) by (service) * 100 > 5for:1mlabels:severity:criticalannotations:summary:Traefik high HTTP 5xx error rate service (instance {{ $labels.instance }})description:"Traefikservice5xxerrorrateisabove5%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikBackendDownexpr:count(traefik_backend_server_up) by (backend) == 0for:0mlabels:severity:criticalannotations:summary:Traefik backend down (instance {{ $labels.instance }})description:"AllTraefikbackendsaredown\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikHighHttp4xxErrorRateBackendexpr:sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5for:1mlabels:severity:criticalannotations:summary:Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})description:"Traefikbackend4xxerrorrateisabove5%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:TraefikHighHttp5xxErrorRateBackendexpr:sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5for:1mlabels:severity:criticalannotations:summary:Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})description:"Traefikbackend5xxerrorrateisabove5%\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:JvmMemoryFillingUpexpr:(sum by (instance)(jvm_memory_used_bytes{area="heap"}) / sum by (instance)(jvm_memory_max_bytes{area="heap"})) * 100 > 80for:2mlabels:severity:warningannotations:summary:JVM memory filling up (instance {{ $labels.instance }})description:"JVMmemoryisfillingup(>80%)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Sidekiq queue size (instance {{ $labels.instance }})
description
:
"
Sidekiq
queue
{{
$labels.name
}}
is
growing
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Sidekiq jobs are taking more than 1min to be picked up. Users may be seeing delays in background processing.
[copy]
-alert:SidekiqSchedulingLatencyTooHighexpr:max(sidekiq_queue_latency) > 60for:0mlabels:severity:criticalannotations:summary:Sidekiq scheduling latency too high (instance {{ $labels.instance }})description:"Sidekiqjobsaretakingmorethan1mintobepickedup.Usersmaybeseeingdelaysinbackgroundprocessing.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesNodeNotReadyexpr:kube_node_status_condition{condition="Ready",status="true"} == 0for:10mlabels:severity:criticalannotations:summary:Kubernetes Node not ready (instance {{ $labels.instance }})description:"Node{{$labels.node}}hasbeenunreadyforalongtime\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesNodeOutOfPodCapacityexpr:sum by (node) ((kube_pod_status_phase{phase="Running"} == 1) + on(uid) group_left(node) (0 * kube_pod_info{pod_template_hash=""})) / sum by (node) (kube_node_status_allocatable{resource="pods"}) * 100 > 90for:2mlabels:severity:warningannotations:summary:Kubernetes Node out of pod capacity (instance {{ $labels.instance }})description:"Node{{$labels.node}}isoutofpodcapacity\nVALUE={{$value}}\nLABELS={{$labels}}"
Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.
[copy]
-alert:KubernetesVolumeOutOfDiskSpaceexpr:kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10for:2mlabels:severity:warningannotations:summary:Kubernetes Volume out of disk space (instance {{ $labels.instance }})description:"Volumeisalmostfull(<10%left)\nVALUE={{$value}}\nLABELS={{$labels}}"
Volume under {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.
[copy]
-alert:KubernetesVolumeFullInFourDaysexpr:predict_linear(kubelet_volume_stats_available_bytes[6h:5m], 4 * 24 * 3600) < 0for:0mlabels:severity:criticalannotations:summary:Kubernetes Volume full in four days (instance {{ $labels.instance }})description:"Volumeunder{{$labels.namespace}}/{{$labels.persistentvolumeclaim}}isexpectedtofillupwithinfourdays.Currently{{$value|humanize}}%isavailable.\nVALUE={{$value}}\nLABELS={{$labels}}"
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} has hit maximum number of desired pods
[copy]
-alert:KubernetesHpaScaleMaximumexpr:(kube_horizontalpodautoscaler_status_desired_replicas >= kube_horizontalpodautoscaler_spec_max_replicas) and (kube_horizontalpodautoscaler_spec_max_replicas > 1) and (kube_horizontalpodautoscaler_spec_min_replicas != kube_horizontalpodautoscaler_spec_max_replicas)for:2mlabels:severity:infoannotations:summary:Kubernetes HPA scale maximum (instance {{ $labels.instance }})description:"HPA{{$labels.namespace}}/{{$labels.horizontalpodautoscaler}}hashitmaximumnumberofdesiredpods\nVALUE={{$value}}\nLABELS={{$labels}}"
HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} is constantly at minimum replicas for 50% of the time. Potential cost saving here.
[copy]
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.
[copy]
-alert:KubernetesPodNotHealthyexpr:sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0for:15mlabels:severity:criticalannotations:summary:Kubernetes Pod not healthy (instance {{ $labels.instance }})description:"Pod{{$labels.namespace}}/{{$labels.pod}}hasbeeninanon-runningstateforlongerthan15minutes.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesPodCrashLoopingexpr:increase(kube_pod_container_status_restarts_total[1m]) > 3for:2mlabels:severity:warningannotations:summary:Kubernetes pod crash looping (instance {{ $labels.instance }})description:"Pod{{$labels.namespace}}/{{$labels.pod}}iscrashlooping\nVALUE={{$value}}\nLABELS={{$labels}}"
CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.
[copy]
# Threshold should be customized for each cronjob name.-alert:KubernetesCronjobTooLongexpr:time() - kube_cronjob_next_schedule_time > 3600for:0mlabels:severity:warningannotations:summary:Kubernetes CronJob too long (instance {{ $labels.instance }})description:"CronJob{{$labels.namespace}}/{{$labels.cronjob}}istakingmorethan1htocomplete.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesApiServerErrorsexpr:sum(rate(apiserver_request_total{job="apiserver",code=~"(?:5..)"}[1m])) by (instance, job) / sum(rate(apiserver_request_total{job="apiserver"}[1m])) by (instance, job) * 100 > 3for:2mlabels:severity:criticalannotations:summary:Kubernetes API server errors (instance {{ $labels.instance }})description:"KubernetesAPIserverisexperiencinghigherrorrate\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesApiClientErrorsexpr:(sum(rate(rest_client_requests_total{code=~"(4|5).."}[1m])) by (instance, job) / sum(rate(rest_client_requests_total[1m])) by (instance, job)) * 100 > 1for:2mlabels:severity:criticalannotations:summary:Kubernetes API client errors (instance {{ $labels.instance }})description:"KubernetesAPIclientisexperiencinghigherrorrate\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:KubernetesClientCertificateExpiresNextWeekexpr:apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60for:0mlabels:severity:warningannotations:summary:Kubernetes client certificate expires next week (instance {{ $labels.instance }})description:"Aclientcertificateusedtoauthenticatetotheapiserverisexpiringnextweek.\nVALUE={{$value}}\nLABELS={{$labels}}"
A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.
[copy]
-alert:KubernetesClientCertificateExpiresSoonexpr:apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60for:0mlabels:severity:criticalannotations:summary:Kubernetes client certificate expires soon (instance {{ $labels.instance }})description:"Aclientcertificateusedtoauthenticatetotheapiserverisexpiringinlessthan24.0hours.\nVALUE={{$value}}\nLABELS={{$labels}}"
Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.
[copy]
-alert:KubernetesApiServerLatencyexpr:histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{verb!~"(?:CONNECT|WATCHLIST|WATCH|PROXY)"} [10m])) WITHOUT (subresource)) > 1for:2mlabels:severity:warningannotations:summary:Kubernetes API server latency (instance {{ $labels.instance }})description:"KubernetesAPIserverhasa99thpercentilelatencyof{{$value}}secondsfor{{$labels.verb}}{{$labels.resource}}.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ConsulServiceHealthcheckFailedexpr:consul_catalog_service_node_healthy == 0for:1mlabels:severity:criticalannotations:summary:Consul service healthcheck failed (instance {{ $labels.instance }})description:"Service:`{{$labels.service_name}}`Healthcheck:`{{$labels.service_id}}`\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Consul missing master node (instance {{ $labels.instance }})
description
:
"
Numbers
of
consul
raft
peers
should
be
3,
in
order
to
preserve
quorum.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Etcd insufficient Members (instance {{ $labels.instance }})
description
:
"
Etcd
cluster
should
have
an
odd
number
of
members
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Etcd no Leader (instance {{ $labels.instance }})
description
:
"
Etcd
cluster
have
no
leader
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:EtcdHighNumberOfLeaderChangesexpr:increase(etcd_server_leader_changes_seen_total[10m]) > 2for:0mlabels:severity:warningannotations:summary:Etcd high number of leader changes (instance {{ $labels.instance }})description:"Etcdleaderchangedmorethan2timesduring10minutes\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdHighNumberOfFailedGrpcRequestsexpr:sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.01for:2mlabels:severity:warningannotations
:summary:Etcd high number of failed GRPC requests (instance {{ $labels.instance }})description:"Morethan1%GRPCrequestfailuredetectedinEtcd\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdHighNumberOfFailedGrpcRequestsexpr:sum(rate(grpc_server_handled_total{grpc_code!="OK"}[1m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[1m])) BY (grpc_service, grpc_method) > 0.05for:2mlabels:severity:criticalannotations:summary:Etcd high number of failed GRPC requests (instance {{ $labels.instance }})description:"Morethan5%GRPCrequestfailuredetectedinEtcd\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdHighNumberOfFailedHttpRequestsexpr:sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.01for:2mlabels:severity:warningannotations:summary:Etcd high number of failed HTTP requests (instance {{ $labels.instance }})description:"Morethan1%HTTPfailuredetectedinEtcd\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdHighNumberOfFailedHttpRequestsexpr:sum(rate(etcd_http_failed_total[1m])) BY (method) / sum(rate(etcd_http_received_total[1m])) BY (method) > 0.05for:2mlabels:severity:criticalannotations:summary:Etcd high number of failed HTTP requests (instance {{ $labels.instance }})description:"Morethan5%HTTPfailuredetectedinEtcd\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdMemberCommunicationSlowexpr:histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[1m])) > 0.15for:2mlabels:severity:warningannotations:summary:Etcd member communication slow (instance {{ $labels.instance }})description:"Etcdmembercommunicationslowingdown,99thpercentileisover0.15s\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:EtcdHighNumberOfFailedProposalsexpr:increase(etcd_server_proposals_failed_total[1h]) > 5for:2mlabels:severity:warningannotations:summary:Etcd high number of failed proposals (instance {{ $labels.instance }})description:"Etcdservergotmorethan5failedproposalspasthour\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:IstioPilotDuplicateEntryexpr:sum(rate(pilot_duplicate_envoy_clusters{}[5m])) > 0for:0mlabels:severity:criticalannotations:summary:Istio Pilot Duplicate Entry (instance {{ $labels.instance }})description:"Istiopilotduplicateentryerror.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ArgocdServiceNotSyncedexpr:argocd_app_info{sync_status!="Synced"} != 0for:15mlabels:severity:warningannotations:summary:ArgoCD service not synced (instance {{ $labels.instance }})description:"Service{{$labels.name}}runbyargoiscurrentlynotinsync.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ArgocdServiceUnhealthyexpr:argocd_app_info{health_status!="Healthy"} != 0for:15mlabels:severity:warningannotations:summary:ArgoCD service unhealthy (instance {{ $labels.instance }})description:"Service{{$labels.name}}runbyargoiscurrentlynothealthy.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
Ceph monitor low space (instance {{ $labels.instance }})
description
:
"
Ceph
monitor
storage
is
low.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Ceph OSD Down (instance {{ $labels.instance }})
description
:
"
Ceph
Object
Storage
Daemon
Down
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Ceph Object Storage Daemon latency is high. Please check if it doesn't stuck in weird state.
[copy]
severity
:
warning
annotations
:
summary
:
Ceph high OSD latency (instance {{ $labels.instance }})
description
:
"
Ceph
Object
Storage
Daemon
latency
is
high.
Please
check
if
it
doesn't
stuck
in
weird
state.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Ceph OSD low space (instance {{ $labels.instance }})
description
:
"
Ceph
Object
Storage
Daemon
is
going
out
of
space.
Please
add
more
disks.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Ceph OSD reweighted (instance {{ $labels.instance }})
description
:
"
Ceph
Object
Storage
Daemon
takes
too
much
time
to
resize.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Ceph PG down (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
down.
Please
ensure
that
all
the
data
are
available.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Some Ceph placement groups are incomplete. Please ensure that all the data are available.
[copy]
severity
:
critical
annotations
:
summary
:
Ceph PG incomplete (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
incomplete.
Please
ensure
that
all
the
data
are
available.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Some Ceph placement groups are inconsistent. Data is available but inconsistent across nodes.
[copy]
severity
:
warning
annotations
:
summary
:
Ceph PG inconsistent (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
inconsistent.
Data
is
available
but
inconsistent
across
nodes.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
Ceph PG activation long (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
too
long
to
activate.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.
[copy]
severity
:
warning
annotations
:
summary
:
Ceph PG backfill full (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
located
on
full
Object
Storage
Daemon
on
cluster.
Those
PGs
can
be
unavailable
shortly.
Please
check
OSDs,
change
weight
or
reconfigure
CRUSH
rules.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Ceph PG unavailable (instance {{ $labels.instance }})
description
:
"
Some
Ceph
placement
groups
are
unavailable.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:SpeedtestSlowInternetDownloadexpr:avg_over_time(speedtest_download[10m]) < 100for:0mlabels:severity:warningannotations:summary:SpeedTest Slow Internet Download (instance {{ $labels.instance }})description:"Internetdownloadspeediscurrently{{humanize$value}}Mbps.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
ZFS offline pool (instance {{ $labels.instance }})
description
:
"
A
ZFS
zpool
is
in
a
unexpected
state:
{{
$labels.state
}}.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:ZfsPoolOutOfSpaceexpr:zfs_pool_free_bytes * 100 / zfs_pool_size_bytes < 10 and ON (instance, device, mountpoint) zfs_pool_readonly == 0for:0mlabels:severity:warningannotations:summary:ZFS pool out of space (instance {{ $labels.instance }})description:"Diskisalmostfull(<10%left)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
ZFS pool unhealthy (instance {{ $labels.instance }})
description
:
"
ZFS
pool
state
is
{{
$value
}}.
See
comments
for
more
information.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
warning
annotations
:
summary
:
ZFS collector failed (instance {{ $labels.instance }})
description
:
"
ZFS
collector
for
{{
$labels.instance
}}
has
failed
to
collect
information
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:OpenebsUsedPoolCapacityexpr:openebs_used_pool_capacity_percent > 80for:2mlabels:severity:warningannotations:summary:OpenEBS used pool capacity (instance {{ $labels.instance }})description:"OpenEBSPoolusemorethan80%ofhiscapacity\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Minio cluster disk offline (instance {{ $labels.instance }})
description
:
"
Minio
cluster
disk
is
offline
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
severity
:
critical
annotations
:
summary
:
Minio node disk offline (instance {{ $labels.instance }})
description
:
"
Minio
cluster
node
disk
is
offline
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:MinioDiskSpaceUsageexpr:disk_storage_available / disk_storage_total * 100 < 10for:0mlabels:severity:warningannotations:summary:Minio disk space usage (instance {{ $labels.instance }})description:"Minioavailablefreespaceislow(<10%)\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Compact {{$labels.job}} is failing to execute {{$value | humanize}}% of compactions.
[copy]
-alert:ThanosCompactorHighCompactionFailuresexpr:(sum by (job) (rate(thanos_compact_group_compactions_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_compact_group_compactions_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)for:15mlabels:severity:warningannotations:summary:Thanos Compactor High Compaction Failures (instance {{ $labels.instance }})description:"ThanosCompact{{$labels.job}}isfailingtoexecute{{$value|humanize}}%ofcompactions.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Compact {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.
[copy]
-
alert:ThanosCompactBucketHighOperationFailuresexpr:(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-compact.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-compact.*"}[5m])) * 100 > 5)for:15mlabels:severity:warningannotations:summary:Thanos Compact Bucket High Operation Failures (instance {{ $labels.instance }})description:"ThanosCompact{{$labels.job}}Bucketisfailingtoexecute{{$value|humanize}}%ofoperations.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosCompactHasNotRunexpr:(time() - max by (job) (max_over_time(thanos_objstore_bucket_last_successful_upload_time{job=~".*thanos-compact.*"}[24h]))) / 60 / 60 > 24for:0mlabels:severity:warningannotations:summary:Thanos Compact Has Not Run (instance {{ $labels.instance }})description:"ThanosCompact{{$labels.job}}hasnotuploadedanythingfor24hours.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query" requests.
[copy]
-alert:ThanosQueryHttpRequestQueryErrorRateHighexpr:(sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query"}[5m]))) * 100 > 5for:5mlabels:severity:criticalannotations:summary:Thanos Query Http Request Query Error Rate High (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}isfailingtohandle{{$value|humanize}}%of\"query\"requests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} is failing to handle {{$value | humanize}}% of "query_range" requests.
[copy]
-alert:ThanosQueryHttpRequestQueryRangeErrorRateHighexpr:(sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-query.*", handler="query_range"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-query.*", handler="query_range"}[5m]))) * 100 > 5for:5mlabels:severity:criticalannotations:summary:Thanos Query Http Request Query Range Error Rate High (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}isfailingtohandle{{$value|humanize}}%of\"query_range\"requests.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosQueryGrpcServerErrorRateexpr:(sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-query.*"}[5m]))/ sum by (job) (rate(grpc_server_started_total{job=~".*thanos-query.*"}[5m])) * 100 > 5)for:5mlabels:severity:warningannotations:summary:Thanos Query Grpc Server Error Rate (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}isfailingtohandle{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosQueryGrpcClientErrorRateexpr:(sum by (job) (rate(grpc_client_handled_total{grpc_code!="OK", job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))) * 100 > 5for:5mlabels:severity:warningannotations:summary:Thanos Query Grpc Client Error Rate (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}isfailingtosend{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} have {{$value | humanize}}% of failing DNS queries for store endpoints.
[copy]
-alert:ThanosQueryHighDNSFailuresexpr:(sum by (job) (rate(thanos_query_store_apis_dns_failures_total{job=~".*thanos-query.*"}[5m])) / sum by (job) (rate(thanos_query_store_apis_dns_lookups_total{job=~".*thanos-query.*"}[5m]))) * 100 > 1for:15mlabels:severity:warningannotations:summary:Thanos Query High D N S Failures (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}have{{$value|humanize}}%offailingDNSqueriesforstoreendpoints.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for instant queries.
[copy]
-alert:ThanosQueryInstantLatencyHighexpr:(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m]))) > 40 and sum by (job) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query"}[5m])) > 0)for:10mlabels:severity:criticalannotations:summary:Thanos Query Instant Latency High (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}hasa99thpercentilelatencyof{{$value}}secondsforinstantqueries.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for range queries.
[copy]
-alert:ThanosQueryRangeLatencyHighexpr:(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-query.*", handler="query_range"}[5m]))) > 90 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-query.*", handler="query_range"}[5m])) > 0)for:10mlabels:severity:criticalannotations:summary:Thanos Query Range Latency High (instance {{ $labels.instance }})description:"ThanosQuery{{$labels.job}}hasa99thpercentilelatencyof{{$value}}secondsforrangequeries.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Query {{$labels.job}} has been overloaded for more than 15 minutes. This may be a symptom of excessive simultanous complex requests, low performance of the Prometheus API, or failures within these components. Assess the health of the Thanos query instances, the connnected Prometheus instances, look for potential senders of these requests and then contact support.
[copy]
-alert:ThanosReceiveHttpRequestErrorRateHighexpr:(sum by (job) (rate(http_requests_total{code=~"5..", job=~".*thanos-receive.*", handler="receive"}[5m]))/ sum by (job) (rate(http_requests_total{job=~".*thanos-receive.*", handler="receive"}[5m]))) * 100 > 5for:5mlabels:severity:criticalannotations:summary:Thanos Receive Http Request Error Rate High (instance {{ $labels.instance }})description:"ThanosReceive{{$labels.job}}isfailingtohandle{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Receive {{$labels.job}} has a 99th percentile latency of {{ $value }} seconds for requests.
[copy]
-alert:ThanosReceiveHttpRequestLatencyHighexpr:(histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket{job=~".*thanos-receive.*", handler="receive"}[5m]))) > 10 and sum by (job) (rate(http_request_duration_seconds_count{job=~".*thanos-receive.*", handler="receive"}[5m])) > 0)for:10mlabels:severity:criticalannotations:summary:Thanos Receive Http Request Latency High (instance {{ $labels.instance }})description:"ThanosReceive{{$labels.job}}hasa99thpercentilelatencyof{{$value}}secondsforrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Receive {{$labels.job}} is failing to replicate {{$value | humanize}}% of requests.
[copy]
-alert:ThanosReceiveHighReplicationFailuresexpr:thanos_receive_replication_factor > 1 and ((sum by (job) (rate(thanos_receive_replications_total{result="error", job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_replications_total{job=~".*thanos-receive.*"}[5m]))) > (max by (job) (floor((thanos_receive_replication_factor{job=~".*thanos-receive.*"}+1)/ 2)) / max by (job) (thanos_receive_hashring_nodes{job=~".*thanos-receive.*"}))) * 100for:5mlabels:severity:warningannotations:summary:Thanos Receive High Replication Failures (instance {{ $labels.instance }})description:"ThanosReceive{{$labels.job}}isfailingtoreplicate{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosReceiveHighForwardRequestFailuresexpr:(sum by (job) (rate(thanos_receive_forward_requests_total{result="error", job=~".*thanos-receive.*"}[5m]))/ sum by (job) (rate(thanos_receive_forward_requests_total{job=~".*thanos-receive.*"}[5m]))) * 100 > 20for:5mlabels:severity:infoannotations:summary:Thanos Receive High Forward Request Failures (instance {{ $labels.instance }})description:"ThanosReceive{{$labels.job}}isfailingtoforward{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Receive {{$labels.job}} is failing to refresh hashring file, {{$value | humanize}} of attempts failed.
[copy]
-alert:ThanosReceiveHighHashringFileRefreshFailuresexpr:(sum by (job) (rate(thanos_receive_hashrings_file_errors_total{job=~".*thanos-receive.*"}[5m])) / sum by (job) (rate(thanos_receive_hashrings_file_refreshes_total{job=~".*thanos-receive.*"}[5m])) > 0)for:15mlabels:severity:warningannotations:summary:Thanos Receive High Hashring File Refresh Failures (instance {{ $labels.instance }})description:"ThanosReceive{{$labels.job}}isfailingtorefreshhashringfile,{{$value|humanize}}ofattemptsfailed.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosSidecarNoConnectionToStartedPrometheusexpr:thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0 and on (namespace, pod)prometheus_tsdb_data_replay_duration_seconds != 0for:5mlabels:severity:criticalannotations:summary:Thanos Sidecar No Connection To Started Prometheus (instance {{ $labels.instance }})description:"ThanosSidecar{{$labels.instance}}isunhealthy.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosStoreGrpcErrorRateexpr:(sum by (job) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-store.*"}[5m]))/ sum by (job) (rate(grpc_server_started_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)for:5mlabels:severity:warningannotations:summary:Thanos Store Grpc Error Rate (instance {{ $labels.instance }})description:"ThanosStore{{$labels.job}}isfailingtohandle{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Store {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for store series gate requests.
[copy]
-alert:ThanosStoreSeriesGateLatencyHighexpr:(histogram_quantile(0.99, sum by (job, le) (rate(thanos_bucket_store_series_gate_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_bucket_store_series_gate_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)for:10mlabels:severity:warningannotations:summary:Thanos Store Series Gate Latency High (instance {{ $labels.instance }})description:"ThanosStore{{$labels.job}}hasa99thpercentilelatencyof{{$value}}secondsforstoreseriesgaterequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Store {{$labels.job}} Bucket is failing to execute {{$value | humanize}}% of operations.
[copy]
-alert:ThanosStoreBucketHighOperationFailuresexpr:(sum by (job) (rate(thanos_objstore_bucket_operation_failures_total{job=~".*thanos-store.*"}[5m])) / sum by (job) (rate(thanos_objstore_bucket_operations_total{job=~".*thanos-store.*"}[5m])) * 100 > 5)for:15mlabels:severity:warningannotations:summary:Thanos Store Bucket High Operation Failures (instance {{ $labels.instance }})description:"ThanosStore{{$labels.job}}Bucketisfailingtoexecute{{$value|humanize}}%ofoperations.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Store {{$labels.job}} Bucket has a 99th percentile latency of {{$value}} seconds for the bucket operations.
[copy]
-alert
:ThanosStoreObjstoreOperationLatencyHighexpr:(histogram_quantile(0.99, sum by (job, le) (rate(thanos_objstore_bucket_operation_duration_seconds_bucket{job=~".*thanos-store.*"}[5m]))) > 2 and sum by (job) (rate(thanos_objstore_bucket_operation_duration_seconds_count{job=~".*thanos-store.*"}[5m])) > 0)for:10mlabels:severity:warningannotations:summary:Thanos Store Objstore Operation Latency High (instance {{ $labels.instance }})description:"ThanosStore{{$labels.job}}Buckethasa99thpercentilelatencyof{{$value}}secondsforthebucketoperations.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosRuleQueueIsDroppingAlertsexpr:sum by (job, instance) (rate(thanos_alert_queue_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0for:5mlabels:severity:criticalannotations:summary:Thanos Rule Queue Is Dropping Alerts (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}isfailingtoqueuealerts.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosRuleSenderIsFailingAlertsexpr:sum by (job, instance) (rate(thanos_alert_sender_alerts_dropped_total{job=~".*thanos-rule.*"}[5m])) > 0for:5mlabels:severity:criticalannotations:summary:Thanos Rule Sender Is Failing Alerts (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}isfailingtosendalertstoalertmanager.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosRuleHighRuleEvaluationFailuresexpr:(sum by (job, instance) (rate(prometheus_rule_evaluation_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)for:5mlabels:severity:criticalannotations:summary:Thanos Rule High Rule Evaluation Failures (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}isfailingtoevaluaterules.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosRuleHighRuleEvaluationWarningsexpr:sum by (job, instance) (rate(thanos_rule_evaluation_with_warnings_total{job=~".*thanos-rule.*"}[5m])) > 0for:15mlabels:severity:infoannotations:summary:Thanos Rule High Rule Evaluation Warnings (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}hashighnumberofevaluationwarnings.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Rule {{$labels.instance}} has higher evaluation latency than interval for {{$labels.rule_group}}.
[copy]
-alert:ThanosRuleRuleEvaluationLatencyHighexpr:(sum by (job, instance, rule_group) (prometheus_rule_group_last_duration_seconds{job=~".*thanos-rule.*"}) > sum by (job, instance, rule_group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"}))for:5mlabels:severity:warningannotations:summary:Thanos Rule Rule Evaluation Latency High (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}hashigherevaluationlatencythanintervalfor{{$labels.rule_group}}.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosRuleGrpcErrorRateexpr:(sum by (job, instance) (rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded", job=~".*thanos-rule.*"}[5m]))/ sum by (job, instance) (rate(grpc_server_started_total{job=~".*thanos-rule.*"}[5m])) * 100 > 5)for:5mlabels:severity:warningannotations:summary:Thanos Rule Grpc Error Rate (instance {{ $labels.instance }})description:"ThanosRule{{$labels.job}}isfailingtohandle{{$value|humanize}}%ofrequests.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Rule {{$labels.job}} has {{$value | humanize}}% of failing DNS queries for query endpoints.
[copy]
-alert:ThanosRuleQueryHighDNSFailuresexpr:(sum by (job, instance) (rate(thanos_rule_query_apis_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_query_apis_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)for:15mlabels:severity:warningannotations:summary:Thanos Rule Query High D N S Failures (instance {{ $labels.instance }})
description:"ThanosRule{{$labels.job}}has{{$value|humanize}}%offailingDNSqueriesforqueryendpoints.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Rule {{$labels.instance}} has {{$value | humanize}}% of failing DNS queries for Alertmanager endpoints.
[copy]
-alert:ThanosRuleAlertmanagerHighDNSFailuresexpr:(sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_failures_total{job=~".*thanos-rule.*"}[5m])) / sum by (job, instance) (rate(thanos_rule_alertmanagers_dns_lookups_total{job=~".*thanos-rule.*"}[5m])) * 100 > 1)for:15mlabels:severity:warningannotations:summary:Thanos Rule Alertmanager High D N S Failures (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}has{{$value|humanize}}%offailingDNSqueriesforAlertmanagerendpoints.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Rule {{$labels.job}} has rule groups that did not evaluate for at least 10x of their expected interval.
[copy]
-alert:ThanosRuleNoEvaluationFor10Intervalsexpr:time() - max by (job, instance, group) (prometheus_rule_group_last_evaluation_timestamp_seconds{job=~".*thanos-rule.*"})>10 * max by (job, instance, group) (prometheus_rule_group_interval_seconds{job=~".*thanos-rule.*"})for:5mlabels:severity:infoannotations:summary:Thanos Rule No Evaluation For10 Intervals (instance {{ $labels.instance }})description:"ThanosRule{{$labels.job}}hasrulegroupsthatdidnotevaluateforatleast10xoftheirexpectedinterval.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Rule {{$labels.instance}} did not perform any rule evaluations in the past 10 minutes.
[copy]
-alert:ThanosNoRuleEvaluationsexpr:sum by (job, instance) (rate(prometheus_rule_evaluations_total{job=~".*thanos-rule.*"}[5m])) <= 0 and sum by (job, instance) (thanos_rule_loaded_rules{job=~".*thanos-rule.*"}) > 0for:5mlabels:severity:criticalannotations:summary:Thanos No Rule Evaluations (instance {{ $labels.instance }})description:"ThanosRule{{$labels.instance}}didnotperformanyruleevaluationsinthepast10minutes.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosBucketReplicateErrorRateexpr:(sum by (job) (rate(thanos_replicate_replication_runs_total{result="error", job=~".*thanos-bucket-replicate.*"}[5m]))/ on (job) group_left sum by (job) (rate(thanos_replicate_replication_runs_total{job=~".*thanos-bucket-replicate.*"}[5m]))) * 100 >= 10for:5mlabels:severity:criticalannotations:summary:Thanos Bucket Replicate Error Rate (instance {{ $labels.instance }})description:"ThanosReplicateisfailingtorun,{{$value|humanize}}%ofattemptsfailed.\nVALUE={{$value}}\nLABELS={{$labels}}"
Thanos Replicate {{$labels.job}} has a 99th percentile latency of {{$value}} seconds for the replicate operations.
[copy]
-alert:ThanosBucketReplicateRunLatencyexpr:(histogram_quantile(0.99, sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m]))) > 20 and sum by (job) (rate(thanos_replicate_replication_run_duration_seconds_bucket{job=~".*thanos-bucket-replicate.*"}[5m])) > 0)for:5mlabels:severity:criticalannotations:summary:Thanos Bucket Replicate Run Latency (instance {{ $labels.instance }})description:"ThanosReplicate{{$labels.job}}hasa99thpercentilelatencyof{{$value}}secondsforthereplicateoperations.\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ThanosCompactIsDownexpr:absent(up{job=~".*thanos-compact.*"} == 1)for:5mlabels:severity:criticalannotations:summary:Thanos Compact Is Down (instance {{ $labels.instance }})description:"ThanosCompacthasdisappeared.Prometheustargetforthecomponentcannotbediscovered.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Thanos Query Is Down (instance {{ $labels.instance }})
description
:
"
ThanosQuery
has
disappeared.
Prometheus
target
for
the
component
cannot
be
discovered.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:
ThanosReceiveIsDownexpr:absent(up{job=~".*thanos-receive.*"} == 1)for:5mlabels:severity:criticalannotations:summary:Thanos Receive Is Down (instance {{ $labels.instance }})description:"ThanosReceivehasdisappeared.Prometheustargetforthecomponentcannotbediscovered.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Thanos Rule Is Down (instance {{ $labels.instance }})
description
:
"
ThanosRule
has
disappeared.
Prometheus
target
for
the
component
cannot
be
discovered.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:ThanosSidecarIsDownexpr:absent(up{job=~".*thanos-sidecar.*"} == 1)for:5mlabels:severity:criticalannotations:summary:Thanos Sidecar Is Down (instance {{ $labels.instance }})description:"ThanosSidecarhasdisappeared.Prometheustargetforthecomponentcannotbediscovered.\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
critical
annotations
:
summary
:
Thanos Store Is Down (instance {{ $labels.instance }})
description
:
"
ThanosStore
has
disappeared.
Prometheus
target
for
the
component
cannot
be
discovered.
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:LokiProcessTooManyRestartsexpr:changes(process_start_time_seconds{job=~".*loki.*"}[15m]) > 2for:0mlabels:severity:warningannotations:summary:Loki process too many restarts (instance {{ $labels.instance }})description:"Alokiprocesshadtoomanyrestarts(target{{$labels.instance}})\nVALUE={{$value}}\nLABELS={{$labels}}"
Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})
[copy]
-alert:JenkinsBuildsHealthScoreexpr:default_jenkins_builds_health_score < 1for:0mlabels:severity:criticalannotations:summary:Jenkins builds health score (instance {{ $labels.instance }})description:"Healthcheckfailurefor`{{$labels.instance}}`inrealm{{$labels.realm}}/{{$labels.env}}({{$labels.region}})\nVALUE={{$value}}\nLABELS={{$labels}}"
Job run failures: ({{$value}}) {{$labels.jenkins_job}}. Healthcheck failure for `{{$labels.instance}}` in realm {{$labels.realm}}/{{$labels.env}} ({{$labels.region}})
[copy]
-alert:JenkinsRunFailureTotalexpr:delta(jenkins_runs_failure_total[1h]) > 100for:0mlabels:severity:warningannotations:summary:Jenkins run failure total (instance {{ $labels.instance }})description:"Jobrunfailures:({{$value}}){{$labels.jenkins_job}}.Healthcheckfailurefor`{{$labels.instance}}`inrealm{{$labels.realm}}/{{$labels.env}}({{$labels.region}})\nVALUE={{$value}}\nLABELS={{$labels}}"
Last build tests failed: {{$labels.jenkins_job}}. Failed build Tests for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})
[copy]
Last build failed: {{$labels.jenkins_job}}. Failed build for job `{{$labels.jenkins_job}}` on {{$labels.instance}}/{{$labels.env}} ({{$labels.region}})
[copy]
# * RUNNING -1 true - The build had no errors.# * SUCCESS 0 true - The build had no errors.# * UNSTABLE 1 true - The build had some errors but they were not fatal. For example, some tests failed.# * FAILURE 2 false - The build had a fatal error.# * NOT_BUILT 3 false - The module was not built.# * ABORTED 4 false - The build was manually aborted.-alert:JenkinsLastBuildFailedexpr:default_jenkins_builds_last_build_result_ordinal == 2for:0mlabels:severity:warningannotations:summary:Jenkins last build failed (instance {{ $labels.instance }})description:"Lastbuildfailed:{{$labels.jenkins_job}}.Failedbuildforjob`{{$labels.jenkins_job}}`on{{$labels.instance}}/{{$labels.env}}({{$labels.region}})\nVALUE={{$value}}\nLABELS={{$labels}}"
-alert:ApcUpsLessThan15MinutesOfBatteryTimeRemainingexpr:apcupsd_battery_time_left_seconds < 900for:0mlabels:severity:criticalannotations:summary:APC UPS Less than 15 Minutes of battery time remaining (instance {{ $labels.instance }})description:"Batteryisalmostempty(<15Minutesremaining)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
APC UPS AC input outage (instance {{ $labels.instance }})
description
:
"
UPS
now
running
on
battery
(since
{{$value
|
humanizeDuration}})
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
-alert:ApcUpsHighTemperatureexpr:apcupsd_internal_temperature_celsius >= 40for:2mlabels:severity:warningannotations:summary:APC UPS high temperature (instance {{ $labels.instance }})description:"Internaltemperatureishigh({{$value}}°C)\nVALUE={{$value}}\nLABELS={{$labels}}"
severity
:
warning
annotations
:
summary
:
APC UPS high load (instance {{ $labels.instance }})
description
:
"
UPS
load
is
>
80%
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Failed net_version for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`
[copy]
severity
:
critical
annotations
:
summary
:
Provider failed because net_version failed (instance {{ $labels.instance }})
description
:
"
Failed
net_version
for
Provider
`{{$labels.provider}}`
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Failed to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`
[copy]
severity
:
critical
annotations
:
summary
:
Provider failed because get genesis failed (instance {{ $labels.instance }})
description
:
"
Failed
to
get
genesis
for
Provider
`{{$labels.provider}}`
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
net_version timeout for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`
[copy]
severity
:
critical
annotations
:
summary
:
Provider failed because net_version timeout (instance {{ $labels.instance }})
description
:
"
net_version
timeout
for
Provider
`{{$labels.provider}}`
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Timeout to get genesis for Provider `{{$labels.provider}}` in Graph node `{{$labels.instance}}`
[copy]
severity
:
critical
annotations
:
summary
:
Provider failed because get genesis timeout (instance {{ $labels.instance }})
description
:
"
Timeout
to
get
genesis
for
Provider
`{{$labels.provider}}`
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`
[copy]
severity
:
warning
annotations
:
summary
:
Store connection is too slow (instance {{ $labels.instance }})
description
:
"
Store
connection
is
too
slow
to
`{{$labels.pool}}`
pool,
`{{$labels.shard}}`
shard
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"
Store connection is too slow to `{{$labels.pool}}` pool, `{{$labels.shard}}` shard in Graph node `{{$labels.instance}}`
[copy]
severity
:
critical
annotations
:
summary
:
Store connection is too slow (instance {{ $labels.instance }})
description
:
"
Store
connection
is
too
slow
to
`{{$labels.pool}}`
pool,
`{{$labels.shard}}`
shard
in
Graph
node
`{{$labels.instance}}`
\n
VALUE
=
{{
$value
}}
\n
LABELS
=
{{
$labels
}}"