添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

( Openshift 4.4.12) After recreating my cluster I get weird connection issues. At first I got some errors on zookeepers about unknown_ca, I saw an open issue on that, I solved it after uninstalling the cluster and the operator and reinstalling.

Now I keep getting Broker may not be available on kafka connect nodes I am really out of ideas on what is going on.

2020-07-22 07:33:18,553 INFO 10.129.4.1 - - [22/Jul/2020:07:33:18 +0000] "GET / HTTP/1.1" 200 91 1 (org.apache.kafka.connect.runtime.rest.RestServer) [qtp1865219266-23] 2020-07-22 07:33:19,682 WARN [Producer clientId=producer-2] Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) [kafka-producer-network-thread | producer-2] 2020-07-22 07:33:19,682 WARN [Producer clientId=producer-3] Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient) [kafka-producer-network-thread | producer-3]

I attached the logs of brokers, operators and kafka connect and brokers

logs.zip

TBH, I'm not sure what could be the problem. From the log my guess would be some network issue.

  • Can you share the full logs - i.e. from the beginning when the containers start? At startup it prints the configuration. So having the full logs would help us understand what the config is and better read the log from which part did and didn't worked.
  • Can you share the custom resources for Kafka, Connect and any connector configurations which you already might have in Connect? (or did you not yet deployed any connectors?)
  • Does the Kafka cluster work fine from other Kafka applications? Is the problem just Connect or all Kafka clients?
  • It is probably a weird network issue but I don't really understand why, one of the replicas works as expected for now the other one I just restarted it to get the logs from the beginning and it cannot communicate with broker 2. It is pure luck if they work after I restart them.

    We use Lenses.io to create connectors which is able to communicate with the 3 brokers without any issues.

    I added everything in the zip
    crds-logs.zip

    It is probably a weird network issue but I don't really understand why, one of the replicas works as expected for now the other one I just restarted it to get the logs from the beginning and it cannot communicate with broker 2. It is pure luck if they work after I restart them.

    Have you checked if they run on the same worker node and one does work and another one doesn't? Or maybe it is one specific worker node of your cluster which does not work?

    Some things I noticed from the logs:

    You seem to use Kafka 2.5.0 for the broker and in he Connect custom resource you also tell the operator that you want to run 2.5.0. But in the log, it seems like the container image you provided is actually running Kafka 2.1.0:

    2020-07-22 21:03:14,072 INFO Kafka version : 2.1.0 (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder]
    2020-07-22 21:03:14,072 INFO Kafka commitId : 809be928f1ae004e (org.apache.kafka.common.utils.AppInfoParser) [DistributedHerder]
    

    This seems to be confirmed from the log listing the class path artifacts which are also all 2.1.0.

    To avoid any issues, you definitely need to rebuild the image with the Kafka 2.5.0 image from the corresponding Strimzi version. Using old image can cause problems with both Kafka incompatibilities, but also with misconfiguration of the Connect since the operator always requires proper versions of the helper scripts it has inside the image. So you should look at it and rebuild it with the right base image.

    If you think you build it properly with the correct versions - could it be that the replicas which do not work have some stale version of the image while the running replicas have the proper one?

    It seems weird that none of your logs so far show problems with connecting to the node -1. That is used for the initial metadata request. They always seems to complain about one of the broker nodes. So I wonder if you have some problem with the routing to the headless service which is not used for the initial connection to -1.

    Maybe you can try to debug this and compere whether you can for example telnet to kafka-dev-kafka-2.kafka-dev-kafka-brokers.lagom.svc:9093 versus to kafka-dev-kafka-bootstrap.lagom.svc:9093.

    You can also check whether the IP addresses in the logs listed behind the service name correspond to the IP address of the Kafka broker pods. If not it would suggest some DNS issue.

    Thank you for your help, I thought the latest image was 2.1 for kafka connect because of this https://hub.docker.com/r/strimzi/kafka-connect I built the image with the correct image now.

    As for the rest, I cannot telnet any of them and when I check the IPs they do not correspond to the pods:
    "Connection to node 0 (kafka-dev-kafka-0.kafka-dev-kafka-brokers.lagom.svc/10.128.4.27:9093)"

    oc get pods -o wide | grep kafka-dev
    kafka-dev-entity-operator-66984d576f-l87ht 3/3 Running 0 20h 10.128.4.117 ip-10-0-168-72.eu-west-1.compute.internal
    kafka-dev-kafka-0 2/2 Running 0 20h 10.131.6.88 ip-10-0-138-180.eu-west-1.compute.internal
    kafka-dev-kafka-1 2/2 Running 0 20h 10.130.2.239 ip-10-0-132-230.eu-west-1.compute.internal
    kafka-dev-kafka-2 2/2 Running 0 20h 10.128.6.62 ip-10-0-174-126.eu-west-1.compute.internal
    kafka-dev-kafka-jmx-trans-754bc55f47-hb99d 1/1 Running 0 17h 10.129.5.45 ip-10-0-156-74.eu-west-1.compute.internal
    kafka-dev-zookeeper-0 1/1 Running 1 20h 10.130.2.238 ip-10-0-132-230.eu-west-1.compute.internal
    kafka-dev-zookeeper-1 1/1 Running 0 20h 10.129.5.40 ip-10-0-156-74.eu-west-1.compute.internal
    kafka-dev-zookeeper-2 1/1 Running 1 20h 10.128.4.116 ip-10-0-168-72.eu-west-1.compute.internal

    So how would I proceed with troubleshooting the DNS issue? Where the DNS of Openshift really sits and what could I do to propagate the real dns/ ips? Could that be an issue with Openshift 4.4.12, I didn't face such issues before the upgrade
    connect-connect-6885c475cd-t7xmg-connect-connect.log

    Ok, if the IPs don't correspond to the pods, than it looks like some DNS issue when they are not resolving properly. But I'm afraid I know nothing about how DNS is handled in OpenShift, so I cannot help much to fix that. If some replicas work and some not, maybe the DNS issue is limited only to some worker nodes and just restarting them might help maybe?