Alerting: Having dot in label name prevents alert being delivered to prometheus alertmanager
We initially ran into this with an elasticsearch alert grouped by a field called
host.keyword
.
host.keyword
turning into a label in the grafana alert and the alert never showing in prometheus alertmanager (or any other alerts at that point). After inspecting the logs I ran into the following:
{"count":1,"level":"info","logger":"ngalert.sender.router","msg":"Sending alerts to local notifier","org_id":1,"rule_uid":"e6cc3237-ecbd-471e-b179-aedcee6203bf","t":"2023-11-08T14:39:01.660894667Z"}
{"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-08T14:39:31.666675515Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-08T14:39:31.667193405Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-08T14:39:31.670141407Z","url":"http://alertmanager-1.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-08T14:39:31.670231448Z","url":"http://alertmanager-1.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"all attempts to send to Alertmanager failed","notifierUID":"notifier1","t":"2023-11-08T14:39:31.670262518Z"}
{"1":"(MISSING)","component":"dispatcher","err":"alertmanager/prometheus-alertmanager[0]: notify retry canceled due to unrecoverable error after 1 attempts: failed to send alert to Alertmanager: failed to send HTTP request - status code 400","level":"error","logger":"ngalert.notifier.alertmanager","msg":"Notify for alerts failed","num_alerts":1,"orgID":1,"t":"2023-11-08T14:39:31.670380308Z"}
(ok ok, I cheated here and the log is from my test setup hence the test.test)
Afterwards I managed to reproduce it with manually adding a similar label to any alert (I tested it with a prometheus alert)
What did you expect to happen?
I expected the alert to show up in alertmanager.
Did this work before?
We are still in the process of upgrading from grafana 8.5 to 10+ and unifified alerting. On legacy alerting we didn't get this label as far as I remember.
How do we reproduce it?
Create a new alert that through the notification policies will be routed to prometheus alertmanager
Add a label like test.test
to the alert
Trigger said alert
Is the bug inside a dashboard panel?
No response
Environment (with versions)?
Grafana: 10.2.0
OS: Kubernetes
Browser: Firefox
Grafana platform?
Kubernetes
Datasource(s)?
No response
triage/needs-confirmation
used for OSS triage rotation - reported issue needs to be reproduced
labels
Nov 24, 2023
Still able to reproduce this on 10.2.2:
{"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-27T07:53:53.420369651Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-27T07:53:53.420751441Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
I'm also getting the same issue
Attaching below logs for mine as well, using grafana version 9.3
"grafana logger=alerting.notifier.prometheus-alertmanager t=2023-11-28T00:27:56.118297071Z level=warn msg="HTTP request failed" url=https://host/vmalertmanager/api/v1/ │
│ alerts statusCode="400 Bad Request" body="{"status":"error","errorType":"bad_data","error":"\"data.response.status\" is not a valid label name"}"
We've always used prometheus alertmanager to have one central location to send alerting emails from as we have several grafana instances and they're all quite restricted using k8s network policies. It seemed easiest during our migration to just keep this as is. As for the migration of alerts, we're really just manually recreated all of them one-by-one. And in some cases we've adjusted them while we were at this. Originally this specific alert would have told us if there was a certain threshold of logs on any node. While recreating this alert on grafana 10.x.x we figured we should preserve what node it was on and display it in the alert. Which is how we eventually ran into this issue. That being said, as you've mentioned in the comment above, I'll attempt with a newer version of alertmanager (with the pull request you linked include) tomorrow to see whether that would resolve this.
alertmanager main didn't help. I did however make another discovery. If I configure the prometheus alertmanager as a datasource and configure it as an external alertmanager, it works exactly as I would expect. I end up seeing the following in the logs:
{"cfg":1,"level":"warn","logger":"ngalert.sender.external-alertmanager","msg":"Alert sending to external Alertmanager(s) contains label/annotation name with invalid characters","name":"host.keyword","org":1,"t":"2023-12-01T12:34:03.015000632Z"}
But the dot in host.keyword
is properly getting replaced with a _ before being send to prometheus alertmanager (it shows up as host_keyword
in alertmanager). Meaning that this is only bugged when adding an external alertmanager as a contact point.
I think I would still prefer to be able to go with the contact point option, as it's possible to configure multiple urls for the same alertmanager there, unlike the external alertmanager (at least, it seems to be like that).
Well, as I found out. The underscore replacement only happens with external alertmanager (which I'm fine with honestly). Not with alertmanager configured as a contact point. ngalert.sender.external-alertmanager
applies underscores. ngalert.notifier.prometheus-alertmanager
does not and because of it never manages to deliver any alerts.