添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
很酷的墨镜  ·  mac ...·  22 小时前    · 
帅气的玉米  ·  etcd mac 安装_mac ...·  22 小时前    · 
礼貌的领带  ·  ubuntu ...·  22 小时前    · 
傻傻的楼梯  ·  openshift | GUI Free Life·  3 周前    · 
气宇轩昂的铁链  ·  Etcd - Storage ...·  3 周前    · 
胆小的马克杯  ·  Qt ...·  1 月前    · 
拉风的猕猴桃  ·  执行救助-中国法院网·  2 月前    · 

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS release-note Note this issue in the milestone's release notes
  • Rancher version: rancher/rancher:v2.6-head 607fcdd
  • Installation option (Docker install/Helm Chart): helm
  • If Helm Chart, Kubernetes Info:
    Cluster Type (RKE1, RKE2, k3s, EKS, etc): RKE (local), DO RKE2 (downstream)
    Version: v1.20.9 (local), v1.21.4+rke2r1 (ds)
    Node Setup: single-node custom RKE (local) + DO RKE2 (downstream) containing:
  • master1 pool - 1 node with etcd and cp roles
  • master2 pool - 2 nodes with etcd and cp roles
  • worker1 pool - 1 node with worker role
  • worker2 pool - 2 nodes with worker role
  • Describe the bug
    Found when testing rancher/dashboard#3834

    When user has two pools with cp+etcd nodes (+ extra workers pool) and deletes one of them the cluster should be Active after a while again. The cluster becomes unreachable instead.

    To Reproduce

  • deploy a local rancher cluster
  • provision DO RKE2 cluster with 4 pools described above with default values
  • wait until the RKE2 cluster is Ready and Active
  • go to Cluster Management and do Edit Config on DO RKE2 cluster
  • delete master2 pool (containing 2 nodes both with cp + etcd role) by selecting the pool and pressing the minus-sign, press Save button.
  • Result
    The RKE2 cluster will end up in Error state and becomes unreachable from dashboard, there is Error message visible:
    Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system?timeout=45s": dial tcp 10.43.0.1:443: connect: connection refused

    Expected Result
    The cluster should become Active again after a while.

    Additional context

  • the same approach worked okay when user delete a pool with worker nodes as there is still another pool with one worker. The state of the cluster turns to Provisioning and then Active again after worker2 pool removal.
  • the nodes from the deleted pool(s) are getting deleted from DO Cloud correctly
  • the same issue reproduced on AWS EC2 RKE2 cluster as well
  • Screenshot
    This is how it looks before the cluster became unreachable (pool worker2 with extra worker nodes has been deleted before successfully):

    Looking at this with @kinarashah , we believe that what is happening is that etcd is losing quorum as there is not currently logic to remove etcd members. It appears that the logic for removing etcd members does not get triggered all the time, and is specifically exasperated in the machinepool case as the underlying VMs are deleted before the routine has a chance to run.

    changed the title Failed to communicate with API server when one of the control-plane pools is deleted on RKE2 cluster Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster Aug 27, 2021 changed the title Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster Failed to communicate with API server when etcd role node is removed Aug 27, 2021
  • Create Digital Ocean RKE2 downstream cluster with the following
  • main1 pool - 1 node with etcd and cp roles
  • main2 pool - 2 nodes with etcd and cp roles
  • work1 pool - 1 node with worker role
  • work2 pool - 2 nodes with worker role
  • Wait until the RKE2 cluster has a state of Active
  • Edit Config on the RKE2 Digital Ocean cluster
  • Delete the main2 pool which contains 2 nodes both with cp + etcd roles
  • Press the Save button
  • Result

    The RKE2 cluster will eventually go into an error state with the following warning. See attached screenshot below labeled A1 .

    Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://[IP_ADDRESS]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded

    Validation Environment

  • Create Digital Ocean RKE2 downstream cluster with the following
  • main1 pool - 1 node with etcd and cp roles
  • main2 pool - 2 nodes with etcd and cp roles
  • work1 pool - 1 node with worker role
  • work2 pool - 2 nodes with worker role
  • Wait until the RKE2 cluster has a state of Active
  • Edit Config on the RKE2 Digital Ocean cluster
  • Delete the main2 pool which contains 2 nodes both with cp + etcd roles
  • Press the Save button
  • Result

    No errors and the cluster is still in Active state.

    kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS release-note Note this issue in the milestone's release notes