[BUG] all-roles K3s cluster hits `Failed to communicate with API server error`, intermittently when

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

千杯不醉的剪刀 · 利文斯顿谈16年73胜勇士惨败的具体关键因素 ...· 3 月前 ·

朝气蓬勃的镜子 · [MySQL]更新时间（加上或者减去一段时间 ...· 5 月前 ·

唠叨的鸵鸟 · Soori Bali | Luxury ...· 5 月前 ·

正直的铁链 · 青年干部要做作风建设的表率· 6 月前 ·

千年单身的冲锋衣 · 乱马1/2 | NIPPON TV· 6 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account [BUG] all-roles K3s cluster hits


    Failed to communicate with API server error

, intermittently when scaling up/down, after a manual snapshot is taken #43329 [BUG] all-roles K3s cluster hits


   Failed to communicate with API server error

, intermittently when scaling up/down, after a manual snapshot is taken #43329 Josh-Diamond opened this issue Oct 30, 2023 · 4 comments kind/bug status/release-blocker team/hostbusters

Kubernetes version:


    v1.27.6+k3s1

Cluster Type (Local/Downstream): All-roles Downstream AWS Node driver

User Information

What is the role of the user logged in?


    Admin

Describe the bug
Intermittently, when scaling all-roles k3s Node driver clusters, the cluster will enter an error state w/ the following message: cluster health check failed: Failed to communicate with API server during namespace check: apiserver not ready

To Reproduce

Fresh install of Rancher


    v2.8.0-rc3

Provision a single-node downstream AWS Node driver cluster w/ all-roles + k8s


    v1.27.6+k3s1

Once


    active

, manually take a snapshot

Once captured, scale the cluster up to 3 total nodes, scaling up 1 at a time

Once scaled up, delete the init node

Once init node has been removed, scale the cluster back down to 1 node

Reproduced

Result
Cluster stuck in Error state w/ the following error seen on one of the machines: Error applying plan -- check rancher-system-agent.service logs on node for more information

Expected Result
Cluster expected to successfully scale back down to 1 node

Screenshots

Additional context
empty logs on node(machine?) in error state

-- Logs begin at Mon 2023-10-30 17:34:13 UTC. --
Rancher Server logs
2023/10/30 18:20:27 [ERROR] error syncing 'c-m-<REDACTED>': handler cluster-deploy: apiserver not ready, requeuing
2023/10/30 18:20:27 [INFO] [planner] rkecluster fleet-default/jkeslar-k3s-new: configuring bootstrap node(s) jkeslar-k3s-new-pool1-<REDACTED>: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
2023/10/30 18:20:38 [ERROR] error syncing 'fleet-default/jkeslar-k3s-new-bootstrap-template-<REDACTED>': handler rke-bootstrap-cluster-name: apiserver not ready, requeuing
W1030 18:21:39.268748      38 warnings.go:80] cluster.x-k8s.io/v1alpha3 MachineSet is deprecated; use cluster.x-k8s.io/v1beta1 MachineSet
W1030 18:22:23.206043      38 warnings.go:80] cluster.x-k8s.io/v1alpha3 Machine is deprecated; use cluster.x-k8s.io/v1beta1 Machine
2023/10/30 18:22:27 [ERROR] error syncing 'c-m-<REDACTED>': handler cluster-deploy: apiserver not ready, requeuing
2023/10/30 18:22:27 [INFO] [planner] rkecluster fleet-default/jkeslar-k3s-new: configuring bootstrap node(s) jkeslar-k3s-new-pool1-<REDACTED>: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
2023/10/30 18:22:38 [ERROR] error syncing 'fleet-default/jkeslar-k3s-new-bootstrap-template-<REDACTED>': handler rke-bootstrap-cluster-name: apiserver not ready, requeuing
  team/hostbusters
  The team that is responsible for provisioning/managing downstream clusters + K8s version support
  labels
      Oct 30, 2023
          This is currently believed to be caused by #43230, which was an intended fix for #43097. This fix was intended to move the etcd safe member removal to before a node is drained, effectively removing it as a participating etcd member, but allowing it to remain as a node in the cluster. However, since k3s does not run etcd as a static pod, this appears to be causing issues wherein removing it from etcd prior to draining causes a complete node failure and prevents the safe node removal from ever succeeding correctly.
          With revert PR #43330 merged, this will be available to be tested on v2.8-head once https://drone-publish.rancher.io/rancher/rancher/10922 (or later build) passes. This issue can be moved "To Test" after that.

Edit: Rerunning CI - https://drone-publish.rancher.io/rancher/rancher/10925/1/1
          @Josh-Diamond , ready to test now since the build passed. If the issue is no longer reproducible, please close it and also remove the milestone as technically this issue was not present in any released version so it doesn't seem right to close it with v2.8.0 milestone as it wasn't "fixed" in that milestone. FYI @daviswill2 @Jono-SUSE-Rancher
The following scenario was successfully executed on 5 clusters:
Fresh install of Rancher v2.8-head
Provision a single-node all-roles downstream K3s AWS Node driver w/ k8s v1.27.7+k3s1
Once active, take a manual snapshot
Once captured, scale up cluster by 1
Verified - cluster successfully scales up; total nodes now 2
Scale up the cluster by 1, once more
Verified - cluster successfully scales up; total nodes now 3
Scale down the cluster by 1
Verified - cluster successfully scales down; total nodes now 2
Scale down cluster by 1, once more
Verified - cluster successfully scales down; as expected
    kind/bug
  Issues that are defects reported by users or that we know have reached a real release
    status/release-blocker
    team/hostbusters
  The team that is responsible for provisioning/managing downstream clusters + K8s version support