添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
从容的砖头  ·  pandas高级操作:list ...·  1 月前    · 
粗眉毛的松鼠  ·  领事动态·  6 月前    · 
聪明伶俐的刺猬  ·  Wacatac Trojan — ...·  1 年前    · 

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account [BUG] all-roles K3s cluster hits Failed to communicate with API server error , intermittently when scaling up/down, after a manual snapshot is taken #43329 [BUG] all-roles K3s cluster hits Failed to communicate with API server error , intermittently when scaling up/down, after a manual snapshot is taken #43329 Josh-Diamond opened this issue Oct 30, 2023 · 4 comments kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
  • Kubernetes version: v1.27.6+k3s1
  • Cluster Type (Local/Downstream): All-roles Downstream AWS Node driver
  • User Information

  • What is the role of the user logged in? Admin
  • Describe the bug
    Intermittently, when scaling all-roles k3s Node driver clusters, the cluster will enter an error state w/ the following message: cluster health check failed: Failed to communicate with API server during namespace check: apiserver not ready

    To Reproduce

  • Fresh install of Rancher v2.8.0-rc3
  • Provision a single-node downstream AWS Node driver cluster w/ all-roles + k8s v1.27.6+k3s1
  • Once active , manually take a snapshot
  • Once captured, scale the cluster up to 3 total nodes, scaling up 1 at a time
  • Once scaled up, delete the init node
  • Once init node has been removed, scale the cluster back down to 1 node
  • Reproduced
  • Result
    Cluster stuck in Error state w/ the following error seen on one of the machines: Error applying plan -- check rancher-system-agent.service logs on node for more information

    Expected Result
    Cluster expected to successfully scale back down to 1 node

    Screenshots

    Additional context
    empty logs on node(machine?) in error state

    -- Logs begin at Mon 2023-10-30 17:34:13 UTC. --
    

    Rancher Server logs

    2023/10/30 18:20:27 [ERROR] error syncing 'c-m-<REDACTED>': handler cluster-deploy: apiserver not ready, requeuing
    2023/10/30 18:20:27 [INFO] [planner] rkecluster fleet-default/jkeslar-k3s-new: configuring bootstrap node(s) jkeslar-k3s-new-pool1-<REDACTED>: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
    2023/10/30 18:20:38 [ERROR] error syncing 'fleet-default/jkeslar-k3s-new-bootstrap-template-<REDACTED>': handler rke-bootstrap-cluster-name: apiserver not ready, requeuing
    W1030 18:21:39.268748      38 warnings.go:80] cluster.x-k8s.io/v1alpha3 MachineSet is deprecated; use cluster.x-k8s.io/v1beta1 MachineSet
    W1030 18:22:23.206043      38 warnings.go:80] cluster.x-k8s.io/v1alpha3 Machine is deprecated; use cluster.x-k8s.io/v1beta1 Machine
    2023/10/30 18:22:27 [ERROR] error syncing 'c-m-<REDACTED>': handler cluster-deploy: apiserver not ready, requeuing
    2023/10/30 18:22:27 [INFO] [planner] rkecluster fleet-default/jkeslar-k3s-new: configuring bootstrap node(s) jkeslar-k3s-new-pool1-<REDACTED>: error applying plan -- check rancher-system-agent.service logs on node for more information, waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet
    2023/10/30 18:22:38 [ERROR] error syncing 'fleet-default/jkeslar-k3s-new-bootstrap-template-<REDACTED>': handler rke-bootstrap-cluster-name: apiserver not ready, requeuing
      team/hostbusters
      The team that is responsible for provisioning/managing downstream clusters + K8s version support
      labels
          Oct 30, 2023
              

    This is currently believed to be caused by #43230, which was an intended fix for #43097. This fix was intended to move the etcd safe member removal to before a node is drained, effectively removing it as a participating etcd member, but allowing it to remain as a node in the cluster. However, since k3s does not run etcd as a static pod, this appears to be causing issues wherein removing it from etcd prior to draining causes a complete node failure and prevents the safe node removal from ever succeeding correctly.

    With revert PR #43330 merged, this will be available to be tested on v2.8-head once https://drone-publish.rancher.io/rancher/rancher/10922 (or later build) passes. This issue can be moved "To Test" after that.
    Edit: Rerunning CI - https://drone-publish.rancher.io/rancher/rancher/10925/1/1

    @Josh-Diamond , ready to test now since the build passed. If the issue is no longer reproducible, please close it and also remove the milestone as technically this issue was not present in any released version so it doesn't seem right to close it with v2.8.0 milestone as it wasn't "fixed" in that milestone. FYI @daviswill2 @Jono-SUSE-Rancher

    The following scenario was successfully executed on 5 clusters:

  • Fresh install of Rancher v2.8-head
  • Provision a single-node all-roles downstream K3s AWS Node driver w/ k8s v1.27.7+k3s1
  • Once active, take a manual snapshot
  • Once captured, scale up cluster by 1
  • Verified - cluster successfully scales up; total nodes now 2
  • Scale up the cluster by 1, once more
  • Verified - cluster successfully scales up; total nodes now 3
  • Scale down the cluster by 1
  • Verified - cluster successfully scales down; total nodes now 2
  • Scale down cluster by 1, once more
  • Verified - cluster successfully scales down; as expected
  • kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support