Failed to communicate with API server when etcd role node is removed · Issue #34488

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

很酷的墨镜 · mac ...· 22 小时前 ·

帅气的玉米 · etcd mac 安装_mac ...· 22 小时前 ·

礼貌的领带 · ubuntu ...· 22 小时前 ·

傻傻的楼梯 · openshift | GUI Free Life· 3 周前 ·

气宇轩昂的铁链 · Etcd - Storage ...· 3 周前 ·

快乐的小熊猫 · 历史进程中的农业无人机-35斗· 2 周前 ·

稳重的冰淇淋 · 泳池的乐园漫画免费阅读「下拉观看」-汗汗漫画· 4 周前 ·

胆小的马克杯 · Qt ...· 1 月前 ·

拉风的猕猴桃 · 执行救助-中国法院网· 2 月前 ·

礼貌的米饭 · 琅琊石刻---我国仅存的两块秦碑之一 ...· 3 月前 ·

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account kind/bug-qa QA/XS release-note

Rancher version:


    rancher/rancher:v2.6-head 607fcdd

Installation option (Docker install/Helm Chart): helm

If Helm Chart, Kubernetes Info:
Cluster Type (RKE1, RKE2, k3s, EKS, etc): RKE (local), DO RKE2 (downstream)
Version: v1.20.9 (local), v1.21.4+rke2r1 (ds)
Node Setup: single-node custom RKE (local) + DO RKE2 (downstream) containing:

master1 pool - 1 node with etcd and cp roles

master2 pool - 2 nodes with etcd and cp roles

worker1 pool - 1 node with worker role

worker2 pool - 2 nodes with worker role

Describe the bug
Found when testing rancher/dashboard#3834

When user has two pools with cp+etcd nodes (+ extra workers pool) and deletes one of them the cluster should be Active after a while again. The cluster becomes unreachable instead.

To Reproduce

deploy a local rancher cluster

provision DO RKE2 cluster with 4 pools described above with default values

wait until the RKE2 cluster is Ready and


    Active

go to


    Cluster Management

and do


    Edit Config

on DO RKE2 cluster

delete master2 pool (containing 2 nodes both with


    cp + etcd

role) by selecting the pool and pressing the minus-sign, press


    Save

button.

Result
The RKE2 cluster will end up in Error state and becomes unreachable from dashboard, there is Error message visible:
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system?timeout=45s": dial tcp 10.43.0.1:443: connect: connection refused

Expected Result
The cluster should become Active again after a while.

Additional context

the same approach worked okay when user delete a pool with worker nodes as there is still another pool with one worker. The state of the cluster turns to


    Provisioning

and then


    Active

again after worker2 pool removal.

the nodes from the deleted pool(s) are getting deleted from DO Cloud correctly

the same issue reproduced on AWS EC2 RKE2 cluster as well

Screenshot
This is how it looks before the cluster became unreachable (pool worker2 with extra worker nodes has been deleted before successfully):

Looking at this with @kinarashah , we believe that what is happening is that etcd is losing quorum as ~~there is not currently logic to remove etcd members.~~ It appears that the logic for removing etcd members does not get triggered all the time, and is specifically exasperated in the machinepool case as the underlying VMs are deleted before the routine has a chance to run.

changed the title ~~Failed to communicate with API server when one of the control-plane pools is deleted on RKE2 cluster~~ Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster

Aug 27, 2021

changed the title ~~Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster~~ Failed to communicate with API server when etcd role node is removed

Aug 27, 2021

Create Digital Ocean RKE2 downstream cluster with the following

main1 pool - 1 node with etcd and cp roles

main2 pool - 2 nodes with etcd and cp roles

work1 pool - 1 node with worker role

work2 pool - 2 nodes with worker role

Wait until the RKE2 cluster has a state of


    Active


    Edit Config

on the RKE2 Digital Ocean cluster

Delete the


    main2

pool which contains 2 nodes both with cp + etcd roles

Press the


    Save

button

Result

The RKE2 cluster will eventually go into an error state with the following warning. See attached screenshot below labeled A1 .

Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://[IP_ADDRESS]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded

Validation Environment