Jaeger dropping spans with Elasticsearch backend in Kubernetes - APM

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

有爱心的显示器 · 【黄亲国麒】黄麒Eros20240508个播 ...· 4 周前 ·

干练的围巾 · （1）ARM ...· 1 月前 ·

大鼻子的绿豆 · 沈阳地铁7号线东段方案初定！|车辆段|沈阳市 ...· 1 月前 ·

谦逊的小熊猫 · 手把手教你如何快速构建应用内消息推送与运营能 ...· 3 月前 ·

叛逆的乒乓球 · 柴北缘东段奥陶纪埃达克岩-富Nb玄武岩:对大 ...· 4 月前 ·

Hello,

I have installed Jaeger with an Elasticsearch storage backend in Kubernetes (AWS EKS). I have 8 r5.4xlarge nodes and Elasticsearch version 7.16.2 ( docker.elastic.co/elasticsearch/elasticsearch:7.16.2 ). The deployment is done using the Elasticsearch helm chart ( helm-charts/elasticsearch at main · elastic/helm-charts · GitHub ). The values I have overridden are:

  elasticsearch:
    replicas: 4
    minimumMasterNodes: 3
    volumeClaimTemplate:
      accessModes: ['ReadWriteOnce']
      resources:
        requests:
          storage: 2000Gi
    resources:
      requests:
        cpu: '5000m'
        memory: '32Gi'
      limits:
        cpu: '8000m'
        memory: '32Gi'
    esJavaOpts: '-Xmx16g -Xms16g'
I have been seeing periodic dropping of spans, correlating with a spike in in-queue and save latency. Save latency, particularly is an indication that the storage backend isn't able to keep up. Additionally, the spikes correlate with an increase in flush operation timing. Also, the index write's memory steadily increases until the flush, when it drops and we have a drop in spans and spikes in latency. I don't have much experience with Elasticsearch, so any help/guidance is appreciated!
I have attached some screenshots of our Jaeger and Elasticsearch dashboards.

Screen Shot 2022-01-18 at 6.44.56 PM3472×1098 408 KB
Screen Shot 2022-01-18 at 7.21.34 PM1920×992 114 KB
              
The issue was not Elasticsearch, but rather the way we were routing traffic from our OpenTelemetry collector fleet to our Jaeger collector fleet. We were routing using Kubernetes' internal routing jaeger-collector.jaeger.svc.cluster.local:14250. We noticed the traces were not distributed evenly across the jaeger collectors, so we swapped in an Application load balancer with a target group using the gRPC protocol version. The root cause for us was the way gRPC works. After the swap, we are not dropping spans, translog ops and size is tiny and the index write memory is steady and flat.