I have been seeing periodic dropping of spans, correlating with a spike in in-queue and save latency. Save latency, particularly is an indication that the storage backend isn't able to keep up. Additionally, the spikes correlate with an increase in flush operation timing. Also, the index write's memory steadily increases until the flush, when it drops and we have a drop in spans and spikes in latency. I don't have much experience with Elasticsearch, so any help/guidance is appreciated!
I have attached some screenshots of our Jaeger and Elasticsearch dashboards. Screen Shot 2022-01-18 at 6.44.56 PM3472×1098 408 KBScreen Shot 2022-01-18 at 7.21.34 PM1920×992 114 KB
The issue was not Elasticsearch, but rather the way we were routing traffic from our OpenTelemetry collector fleet to our Jaeger collector fleet. We were routing using Kubernetes' internal routing jaeger-collector.jaeger.svc.cluster.local:14250. We noticed the traces were not distributed evenly across the jaeger collectors, so we swapped in an Application load balancer with a target group using the gRPC protocol version. The root cause for us was the way gRPC works. After the swap, we are not dropping spans, translog ops and size is tiny and the index write memory is steady and flat.