[Question] ... Kafka Connect broker may not be available · Issue #3360 · strimzi/strimzi-kafka-operator

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

独立的汉堡包 · How to Copy Files ...· 2 月前 ·

飞奔的蚂蚁 · 类型“Context”上不存在属性“http ...· 3 月前 ·

想发财的沙发 · 云南大学2024年“少数民族高层次骨干人才计 ...· 3 月前 ·

卖萌的佛珠 · 周杰伦惊艳的6首古风歌曲，《发如雪》上榜，最 ...· 3 月前 ·

沉稳的杨桃 · 【中文师资】同济大学人文学院中文系师资介绍- ...· 7 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

( Openshift 4.4.12) After recreating my cluster I get weird connection issues. At first I got some errors on zookeepers about unknown_ca, I saw an open issue on that, I solved it after uninstalling the cluster and the operator and reinstalling.

Now I keep getting Broker may not be available on kafka connect nodes I am really out of ideas on what is going on.

2020-07-22 07:33:18,553 INFO 10.129.4.1 - - [22/Jul/2020:07:33:18 +0000] "GET / HTTP/1.1" 200 91 1 (org.apache.kafka.connect.runtime.rest.RestServer) [qtp1865219266-23] 2020-07-22 07:33:19,682 WARN [Producer clientId=producer-2] Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) [kafka-producer-network-thread | producer-2] 2020-07-22 07:33:19,682 WARN [Producer clientId=producer-3] Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) [kafka-producer-network-thread | producer-3]

I attached the logs of brokers, operators and kafka connect and brokers

logs.zip

TBH, I'm not sure what could be the problem. From the log my guess would be some network issue.

Can you share the full logs - i.e. from the beginning when the containers start? At startup it prints the configuration. So having the full logs would help us understand what the config is and better read the log from which part did and didn't worked.

Can you share the custom resources for Kafka, Connect and any connector configurations which you already might have in Connect? (or did you not yet deployed any connectors?)

Does the Kafka cluster work fine from other Kafka applications? Is the problem just Connect or all Kafka clients?

It is probably a weird network issue but I don't really understand why, one of the replicas works as expected for now the other one I just restarted it to get the logs from the beginning and it cannot communicate with broker 2. It is pure luck if they work after I restart them.

We use Lenses.io to create connectors which is able to communicate with the 3 brokers without any issues.

I added everything in the zip
crds-logs.zip

Have you checked if they run on the same worker node and one does work and another one doesn't? Or maybe it is one specific worker node of your cluster which does not work?

Some things I noticed from the logs:

You seem to use Kafka 2.5.0 for the broker and in he Connect custom resource you also tell the operator that you want to run 2.5.0. But in the log, it seems like the container image you provided is actually running Kafka 2.1.0:

2020-07-22 21:03:14,072 INFO Kafka version : 2.1.0 (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder]
2020-07-22 21:03:14,072 INFO Kafka commitId : 809be928f1ae004e (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder]
This seems to be confirmed from the log listing the class path artifacts which are also all 2.1.0.
To avoid any issues, you definitely need to rebuild the image with the Kafka 2.5.0 image from the corresponding  Strimzi version. Using old image can cause problems with both Kafka incompatibilities, but also with misconfiguration of the Connect since the operator always requires proper versions of the helper scripts it has inside the image. So you should look at it and rebuild it with the right base image.
If you think you build it properly with the correct versions - could it be that the replicas which do not work have some stale version of the image while the running replicas have the proper one?
It seems weird that none of your logs so far show problems with connecting to the node -1. That is used for the initial metadata request. They always seems to complain about one of the broker nodes. So I wonder if you have some problem with the routing to the headless service which is not used for the initial connection to -1.
Maybe you can try to debug this and compere whether you can for example telnet to kafka-dev-kafka-2.kafka-dev-kafka-brokers.lagom.svc:9093 versus to kafka-dev-kafka-bootstrap.lagom.svc:9093.
You can also check whether the IP addresses in the logs listed behind the service name correspond to the IP address of the Kafka broker pods. If not it would suggest some DNS issue.
          Thank you for your help, I thought the latest image was 2.1 for kafka connect  because of this https://hub.docker.com/r/strimzi/kafka-connect I built the image with the correct image now.
As for the rest, I cannot telnet any of them and when I check the IPs they do not correspond to the pods:

"Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093)"
oc get pods -o wide | grep kafka-dev

kafka-dev-entity-operator-66984d576f-l87ht           3/3     Running            0          20h     10.128.4.117   ip-10-0-168-72.eu-west-1.compute.internal               

kafka-dev-kafka-0                                    2/2     Running            0          20h     10.131.6.88    ip-10-0-138-180.eu-west-1.compute.internal              

kafka-dev-kafka-1                                    2/2     Running            0          20h     10.130.2.239   ip-10-0-132-230.eu-west-1.compute.internal              

kafka-dev-kafka-2                                    2/2     Running            0          20h     10.128.6.62    ip-10-0-174-126.eu-west-1.compute.internal              

kafka-dev-kafka-jmx-trans-754bc55f47-hb99d           1/1     Running            0          17h     10.129.5.45    ip-10-0-156-74.eu-west-1.compute.internal               

kafka-dev-zookeeper-0                                1/1     Running            1          20h     10.130.2.238   ip-10-0-132-230.eu-west-1.compute.internal              

kafka-dev-zookeeper-1                                1/1     Running            0          20h     10.129.5.40    ip-10-0-156-74.eu-west-1.compute.internal               

kafka-dev-zookeeper-2                                1/1     Running            1          20h     10.128.4.116   ip-10-0-168-72.eu-west-1.compute.internal               
So how would I proceed with troubleshooting the DNS issue? Where the DNS of Openshift really sits and what could I do to propagate the real dns/ ips? Could that be an issue with Openshift 4.4.12, I didn't face such issues before the upgrade

connect-connect-6885c475cd-t7xmg-connect-connect.log
          Ok, if the IPs don't correspond to the pods, than it looks like some DNS issue when they are not resolving properly. But I'm afraid I know nothing about how DNS is handled in OpenShift, so I cannot help much to fix that. If some replicas work and some not, maybe the DNS issue is limited only to some worker nodes and just restarting them might help maybe?