Hi James,
> Bouncing the clients resolved the issue
Could you please describe which version you upgrade to, to resolve this issue?
That should also help other users encountering the same issue.
And the code snippet you listed, existed since 2018, I don't think there is any
problem there.
Maybe there are bugs existed in other places, and got fixed indirectly.
Thank you.
On Tue, Nov 23, 2021 at 10:27 AM James Olsen
mailto:ja...@inaseq.com
>> wrote:
We had a 2.5.1 Broker/Client system running for some time with regular rolling
OS upgrades to the Brokers without any problems. A while ago we upgraded both
Broker and Clients to 2.7.1 and now on the first rolling OS upgrade to the
2.7.1 Brokers we encountered some Consumer issues. We have a 3 Broker setup
with min-ISRs configured to avoid any outage.
So maybe we just got lucky 6 times in a row with the 2.5.1 or maybe there is an
issue with the 2.7.1.
The observable symptom is a continuous stream of "The coordinator is not
available" messages when trying to commit offsets. It starts with the usual
messages you might expect during a rolling upgrade...
2021-11-22 04:41:25,269 WARN
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
groupId=MyService-group] Offset commit failed on partition MyTopic-0 at offset
866799313: The coordinator is loading and hence can't process requests.
... then 5 minutes of all OK, then ...
2021-11-22 04:46:33,258 WARN
[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
groupId=MyService-group] Offset commit failed on partition MyTopic-0 at offset
866803953: This is not the correct coordinator.
2021-11-22 04:46:33,258 INFO
[org.apache.kafka.clients.consumer.internals.AbstractCoordinator]
'pool-7-thread-132' [Consumer clientId=consumer-MyService-group-58,
groupId=MyService-group] Group coordinator
b-2.xxx.com:9094<
http://b-2.xxx.com:9094/
><
http://b-2.xxx.com:9094
<
http://b-2.xxx.com:9094/
>>
(id: 2147483645 rack: null) is unavailable or invalid due to cause: error
response NOT_COORDINATOR.isDisconnected: false. Rediscovery will be attempted.
2021-11-22 04:46:33,258 WARN [xxx.KafkaConsumerRunner] 'pool-7-thread-132'
Offset commit with offsets {MyTopic-0=OffsetAndMetadata{offset=866803953,
leaderEpoch=null, metadata=''}} failed:
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit
failed with a retriable exception. You should retry committing the latest
consumed offsets.
Caused by: org.apache.kafka.common.errors.NotCoordinatorException: This is not
the correct coordinator.
... then the following message for every subsequent attempt to commit offsets
2021-11-22 04:46:33,284 WARN [xxx.KafkaConsumerRunner] 'pool-7-thread-132'
Offset commit with offsets {MyTopic-0=OffsetAndMetadata{offset=866803954,
leaderEpoch=82, metadata=''}, MyOtherTopic-0=OffsetAndMetadata{offset=12654756,
leaderEpoch=79, metadata=''}} failed:
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit
failed with a retriable exception. You should retry committing the latest
consumed offsets.
Caused by: org.apache.kafka.common.errors.CoordinatorNotAvailableException: The
coordinator is not available.
In the above example we are doing manual async-commits but we also had offset
commit failure for a different consumer group (observed through lag monitoring)
that uses auto-commit, it just didn't log the ongoing failures. In both cases
messages were still being processed, it was just the commits not working.
These are our two busiest consumer groups and both have static Topic
assignments. Other consumer groups continued OK.
I've spent some time examining the (Java) client code and started to wonder
whether there is a bug or race condition that means the coordinator never gets
reassigned after being invalidated and we simply keep hitting the following
short-circuit:
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
RequestFuture<Void> sendOffsetCommitRequest(final Map<TopicPartition,
OffsetAndMetadata> offsets) {
if (offsets.isEmpty())
return RequestFuture.voidSuccess();
Node coordinator = checkAndGetCoordinator();
if (coordinator == null)
return RequestFuture.coordinatorNotAvailable();
I'm not sure what the exact pathway is to getting the coordinator set but I
note that
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(Timer)
and other methods that look like they may be related tend to only log at debug
when they encounter RetriableException so could explain why I don't have more
detail to provide.
I'm not familiar enough with the code to be able to trace this through any
further, but if you've had the patience to keep reading this far then maybe you
Bouncing the clients resolved the issue, but I'd be interested if any experts
out there can identify if there is any weakness in the 2.7.1 version.
Regards, James.
Consumer failure after rolling Broker upgrade
James Olsen
Re: Consumer failure after rolling Broker upgrade
Luke Chen
Re: Consumer failure after rolling Broker upgrade
James Olsen
Re: Consumer failure after rolling Broker upgrade
James Olsen
Re: Consumer failure after rolling Broker upg...
Luke Chen