添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account Alerting: Having dot in label name prevents alert being delivered to prometheus alertmanager #77872 Alerting: Having dot in label name prevents alert being delivered to prometheus alertmanager #77872 schoentoon opened this issue Nov 8, 2023 · 10 comments

What happened?

We initially ran into this with an elasticsearch alert grouped by a field called host.keyword . host.keyword turning into a label in the grafana alert and the alert never showing in prometheus alertmanager (or any other alerts at that point). After inspecting the logs I ran into the following:

{"count":1,"level":"info","logger":"ngalert.sender.router","msg":"Sending alerts to local notifier","org_id":1,"rule_uid":"e6cc3237-ecbd-471e-b179-aedcee6203bf","t":"2023-11-08T14:39:01.660894667Z"}
{"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-08T14:39:31.666675515Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-08T14:39:31.667193405Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-08T14:39:31.670141407Z","url":"http://alertmanager-1.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-08T14:39:31.670231448Z","url":"http://alertmanager-1.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
{"alertmanager":"alertmanager","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"all attempts to send to Alertmanager failed","notifierUID":"notifier1","t":"2023-11-08T14:39:31.670262518Z"}
{"1":"(MISSING)","component":"dispatcher","err":"alertmanager/prometheus-alertmanager[0]: notify retry canceled due to unrecoverable error after 1 attempts: failed to send alert to Alertmanager: failed to send HTTP request - status code 400","level":"error","logger":"ngalert.notifier.alertmanager","msg":"Notify for alerts failed","num_alerts":1,"orgID":1,"t":"2023-11-08T14:39:31.670380308Z"}

(ok ok, I cheated here and the log is from my test setup hence the test.test)

Afterwards I managed to reproduce it with manually adding a similar label to any alert (I tested it with a prometheus alert)

What did you expect to happen?

I expected the alert to show up in alertmanager.

Did this work before?

We are still in the process of upgrading from grafana 8.5 to 10+ and unifified alerting. On legacy alerting we didn't get this label as far as I remember.

How do we reproduce it?

  • Create a new alert that through the notification policies will be routed to prometheus alertmanager
  • Add a label like test.test to the alert
  • Trigger said alert
  • Is the bug inside a dashboard panel?

    No response

    Environment (with versions)?

    Grafana: 10.2.0
    OS: Kubernetes
    Browser: Firefox

    Grafana platform?

    Kubernetes

    Datasource(s)?

    No response

    triage/needs-confirmation used for OSS triage rotation - reported issue needs to be reproduced labels Nov 24, 2023

    Still able to reproduce this on 10.2.2:

    {"Body":"{\"status\":\"error\",\"errorType\":\"bad_data\",\"error\":\"\\\"test.test\\\" is not a valid label name\"}","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"HTTP request failed","notifierUID":"notifier1","statusCode":"400 Bad Request","t":"2023-11-27T07:53:53.420369651Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
    {"alertmanager":"alertmanager","error":"failed to send HTTP request - status code 400","level":"warn","logger":"ngalert.notifier.prometheus-alertmanager","msg":"failed to send to Alertmanager","notifierUID":"notifier1","t":"2023-11-27T07:53:53.420751441Z","url":"http://alertmanager-0.alertmanager-discovery.monitoring.svc:9093/api/v1/alerts"}
              

    I'm also getting the same issue
    Attaching below logs for mine as well, using grafana version 9.3

    "grafana logger=alerting.notifier.prometheus-alertmanager t=2023-11-28T00:27:56.118297071Z level=warn msg="HTTP request failed" url=https://host/vmalertmanager/api/v1/
    │ alerts statusCode="400 Bad Request" body="{"status":"error","errorType":"bad_data","error":"\"data.response.status\" is not a valid label name"}"

    We've always used prometheus alertmanager to have one central location to send alerting emails from as we have several grafana instances and they're all quite restricted using k8s network policies. It seemed easiest during our migration to just keep this as is. As for the migration of alerts, we're really just manually recreated all of them one-by-one. And in some cases we've adjusted them while we were at this. Originally this specific alert would have told us if there was a certain threshold of logs on any node. While recreating this alert on grafana 10.x.x we figured we should preserve what node it was on and display it in the alert. Which is how we eventually ran into this issue. That being said, as you've mentioned in the comment above, I'll attempt with a newer version of alertmanager (with the pull request you linked include) tomorrow to see whether that would resolve this.

    alertmanager main didn't help. I did however make another discovery. If I configure the prometheus alertmanager as a datasource and configure it as an external alertmanager, it works exactly as I would expect. I end up seeing the following in the logs:
    {"cfg":1,"level":"warn","logger":"ngalert.sender.external-alertmanager","msg":"Alert sending to external Alertmanager(s) contains label/annotation name with invalid characters","name":"host.keyword","org":1,"t":"2023-12-01T12:34:03.015000632Z"}
    But the dot in host.keyword is properly getting replaced with a _ before being send to prometheus alertmanager (it shows up as host_keyword in alertmanager). Meaning that this is only bugged when adding an external alertmanager as a contact point.
    I think I would still prefer to be able to go with the contact point option, as it's possible to configure multiple urls for the same alertmanager there, unlike the external alertmanager (at least, it seems to be like that).

    Well, as I found out. The underscore replacement only happens with external alertmanager (which I'm fine with honestly). Not with alertmanager configured as a contact point. ngalert.sender.external-alertmanager applies underscores. ngalert.notifier.prometheus-alertmanager does not and because of it never manages to deliver any alerts.