You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
kind/bug-qa
Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement
QA/XS
release-note
Note this issue in the milestone's release notes
Rancher version:
rancher/rancher:v2.6-head 607fcdd
Installation option (Docker install/Helm Chart): helm
If Helm Chart, Kubernetes Info:
Cluster Type (RKE1, RKE2, k3s, EKS, etc): RKE (local), DO RKE2 (downstream)
Version: v1.20.9 (local), v1.21.4+rke2r1 (ds)
Node Setup: single-node custom RKE (local) + DO RKE2 (downstream) containing:
master1
pool - 1 node with etcd and cp roles
master2
pool - 2 nodes with etcd and cp roles
worker1
pool - 1 node with worker role
worker2
pool - 2 nodes with worker role
Describe the bug
Found when testing
rancher/dashboard#3834
When user has two pools with
cp+etcd
nodes (+ extra workers pool) and deletes one of them the cluster should be
Active
after a while again. The cluster becomes unreachable instead.
To Reproduce
deploy a local rancher cluster
provision DO RKE2 cluster with 4 pools described above with default values
wait until the RKE2 cluster is Ready and
Active
go to
Cluster Management
and do
Edit Config
on DO RKE2 cluster
delete
master2
pool (containing 2 nodes both with
cp + etcd
role) by selecting the pool and pressing the minus-sign, press
Save
button.
Result
The RKE2 cluster will end up in
Error
state and becomes unreachable from dashboard, there is Error message visible:
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://10.43.0.1:443/api/v1/namespaces/kube-system?timeout=45s": dial tcp 10.43.0.1:443: connect: connection refused
Expected Result
The cluster should become
Active
again after a while.
Additional context
the same approach worked okay when user delete a pool with worker nodes as there is still another pool with one worker. The state of the cluster turns to
Provisioning
and then
Active
again after
worker2
pool removal.
the nodes from the deleted pool(s) are getting deleted from DO Cloud correctly
the same issue reproduced on AWS EC2 RKE2 cluster as well
Screenshot
This is how it looks before the cluster became unreachable (pool
worker2
with extra worker nodes has been deleted before successfully):
Looking at this with
@kinarashah
, we believe that what is happening is that etcd is losing quorum as
there is not currently logic to remove etcd members.
It appears that the logic for removing etcd members does not get triggered all the time, and is specifically exasperated in the machinepool case as the underlying VMs are deleted before the routine has a chance to run.
changed the title
Failed to communicate with API server when one of the control-plane pools is deleted on RKE2 cluster
Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster
Aug 27, 2021
changed the title
Failed to communicate with API server when one of the control-plane/etcd pools is deleted on RKE2 cluster
Failed to communicate with API server when etcd role node is removed
Aug 27, 2021
Create Digital Ocean RKE2 downstream cluster with the following
main1 pool - 1 node with etcd and cp roles
main2 pool -
2 nodes
with etcd and cp roles
work1 pool - 1 node with worker role
work2 pool -
2 nodes
with worker role
Wait until the RKE2 cluster has a state of
Active
Edit Config
on the RKE2 Digital Ocean cluster
Delete the
main2
pool which contains 2 nodes both with cp + etcd roles
Press the
Save
button
Result
The RKE2 cluster will eventually go into an error state with the following warning. See attached screenshot below labeled
A1
.
Cluster health check failed: Failed to communicate with API server during namespace check: Get "https://[IP_ADDRESS]/api/v1/namespaces/kube-system?timeout=45s": context deadline exceeded
Validation Environment
Create Digital Ocean RKE2 downstream cluster with the following
main1 pool - 1 node with etcd and cp roles
main2 pool -
2 nodes
with etcd and cp roles
work1 pool - 1 node with worker role
work2 pool -
2 nodes
with worker role
Wait until the RKE2 cluster has a state of
Active
Edit Config
on the RKE2 Digital Ocean cluster
Delete the
main2
pool which contains 2 nodes both with cp + etcd roles
Press the
Save
button
Result
No errors and the cluster is still in
Active
state.
kind/bug-qa
Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement
QA/XS
release-note
Note this issue in the milestone's release notes