Jaeger dropping spans with Elasticsearch backend in Kubernetes - APM

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

知识渊博的帽子 · 在JVM应用程序中分析内存分配情况· 2 月前 ·

想表白的打火机 · 策划：火车票网购攻略_新浪科技_新浪网· 2 月前 ·

鬼畜的大象 · 大湾区文化融合创新的魅力——从《咏春》看舞剧 ...· 2 月前 ·

谈吐大方的水煮鱼 · 用webdav挂载阿里云盘中转数据账号被封了 ...· 3 月前 ·

深情的鞭炮 · SQUARE Concepts - ...· 4 月前 ·

Hello,

I have installed Jaeger with an Elasticsearch storage backend in Kubernetes (AWS EKS). I have 8 r5.4xlarge nodes and Elasticsearch version 7.16.2 ( docker.elastic.co/elasticsearch/elasticsearch:7.16.2 ). The deployment is done using the Elasticsearch helm chart ( helm-charts/elasticsearch at main · elastic/helm-charts · GitHub ). The values I have overridden are:

  elasticsearch:
    replicas: 4
    minimumMasterNodes: 3
    volumeClaimTemplate:
      accessModes: ['ReadWriteOnce']
      resources:
        requests:
          storage: 2000Gi
    resources:
      requests:
        cpu: '5000m'
        memory: '32Gi'
      limits:
        cpu: '8000m'
        memory: '32Gi'
    esJavaOpts: '-Xmx16g -Xms16g'
I have been seeing periodic dropping of spans, correlating with a spike in in-queue and save latency. Save latency, particularly is an indication that the storage backend isn't able to keep up. Additionally, the spikes correlate with an increase in flush operation timing. Also, the index write's memory steadily increases until the flush, when it drops and we have a drop in spans and spikes in latency. I don't have much experience with Elasticsearch, so any help/guidance is appreciated!
I have attached some screenshots of our Jaeger and Elasticsearch dashboards.

Screen Shot 2022-01-18 at 6.44.56 PM3472×1098 408 KB
Screen Shot 2022-01-18 at 7.21.34 PM1920×992 114 KB
              
The issue was not Elasticsearch, but rather the way we were routing traffic from our OpenTelemetry collector fleet to our Jaeger collector fleet. We were routing using Kubernetes' internal routing jaeger-collector.jaeger.svc.cluster.local:14250. We noticed the traces were not distributed evenly across the jaeger collectors, so we swapped in an Application load balancer with a target group using the gRPC protocol version. The root cause for us was the way gRPC works. After the swap, we are not dropping spans, translog ops and size is tiny and the index write memory is steady and flat.