添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Hello,

I have installed Jaeger with an Elasticsearch storage backend in Kubernetes (AWS EKS). I have 8 r5.4xlarge nodes and Elasticsearch version 7.16.2 ( docker.elastic.co/elasticsearch/elasticsearch:7.16.2 ). The deployment is done using the Elasticsearch helm chart ( helm-charts/elasticsearch at main · elastic/helm-charts · GitHub ). The values I have overridden are:

  elasticsearch:
    replicas: 4
    minimumMasterNodes: 3
    volumeClaimTemplate:
      accessModes: ['ReadWriteOnce']
      resources:
        requests:
          storage: 2000Gi
    resources:
      requests:
        cpu: '5000m'
        memory: '32Gi'
      limits:
        cpu: '8000m'
        memory: '32Gi'
    esJavaOpts: '-Xmx16g -Xms16g'

I have been seeing periodic dropping of spans, correlating with a spike in in-queue and save latency. Save latency, particularly is an indication that the storage backend isn't able to keep up. Additionally, the spikes correlate with an increase in flush operation timing. Also, the index write's memory steadily increases until the flush, when it drops and we have a drop in spans and spikes in latency. I don't have much experience with Elasticsearch, so any help/guidance is appreciated!

I have attached some screenshots of our Jaeger and Elasticsearch dashboards.
Screen Shot 2022-01-18 at 6.44.56 PM3472×1098 408 KB Screen Shot 2022-01-18 at 7.21.34 PM1920×992 114 KB

The issue was not Elasticsearch, but rather the way we were routing traffic from our OpenTelemetry collector fleet to our Jaeger collector fleet. We were routing using Kubernetes' internal routing jaeger-collector.jaeger.svc.cluster.local:14250. We noticed the traces were not distributed evenly across the jaeger collectors, so we swapped in an Application load balancer with a target group using the gRPC protocol version. The root cause for us was the way gRPC works. After the swap, we are not dropping spans, translog ops and size is tiny and the index write memory is steady and flat.